How to create a good evaluation function for a game?

Game Development

Evaluation Function

Game AI

Game Design

Algorithm Optimization

How to create a good evaluation function for a game?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

A good evaluation function estimates how favorable a game state is when you cannot search all the way to the end of the game tree. The best evaluation functions are not just a pile of arbitrary weights; they are fast, correlated with winning chances, and stable enough that small search changes do not produce nonsense decisions.

Start from the Win Condition

The first question is simple: what actually causes a player to win? In chess, material, king safety, mobility, and pawn structure matter because they influence checkmate chances. In a racing game, distance to goal and blocking lines may matter more than "material."

That sounds obvious, but many bad evaluation functions start by scoring things that are easy to measure instead of things that predict success.

Combine a Few Strong Features First

A practical evaluation function is usually a weighted sum of features:

score = w1 * feature1 + w2 * feature2 + ...

The important part is feature choice, not just weight tuning. Good starter features are:

material or resource advantage
mobility or legal move count
positional control
immediate threats
distance to objective
safety or vulnerability

Begin with a small set of meaningful terms. A bloated evaluation function with twenty weak features is often worse than a small one with four strong features.

Example for a Simple Board Game

Here is a compact Python example for a grid-based strategy game where the player wants material, mobility, and central control:

python

1def evaluate(state, player):
2    opponent = 1 - player
3
4    material = state.piece_count(player) - state.piece_count(opponent)
5    mobility = len(state.legal_moves(player)) - len(state.legal_moves(opponent))
6    center = state.center_control(player) - state.center_control(opponent)
7
8    return (
9        10 * material +
10        2 * mobility +
11        3 * center
12    )

This is not a universal formula. It is a pattern: compute a few measurable features, normalize them if needed, and combine them with interpretable weights.

Terminal States Must Dominate

A common beginner mistake is letting the heuristic override the actual game result. If a position is checkmate, victory, or guaranteed loss, that must outweigh every nonterminal feature.

For example:

python

1def evaluate(state, player):
2    if state.is_win(player):
3        return 1_000_000
4    if state.is_win(1 - player):
5        return -1_000_000
6
7    return heuristic_score(state, player)

Without this rule, the engine may prefer a beautiful-looking position that is actually losing.

Keep the Features Search-Friendly

The evaluation function runs at every leaf the search reaches, so it must be cheap. If one feature requires a mini-search of its own, the whole engine slows down dramatically.

This creates a real tradeoff:

more features may improve positional understanding
fewer features allow deeper search

In many games, a simpler evaluation plus deeper search beats a very sophisticated evaluation plus shallow search.

Test by Playing Against Itself

The fastest way to improve an evaluation function is empirical. Let two versions of the engine play each other from varied starting positions. If a new feature does not improve results, remove it or retune it.

You can also compare the heuristic score against known good outcomes. If positions that later lose are often scored as strongly favorable, your features are missing something important.

Normalize and Avoid Double Counting

Many evaluation bugs come from mixing feature scales carelessly. If material ranges from -20 to 20 but mobility ranges from -3 to 3, the chosen weights must reflect that. Otherwise one term dominates by accident.

Another common problem is double counting. For example, "piece count" and "total piece value" may represent the same concept twice. That makes tuning harder and can distort the engine's preferences.

Common Pitfalls

Scoring features that are easy to compute instead of features that correlate with winning leads to weak play.
Forgetting to give terminal wins and losses overwhelming scores makes the engine choose attractive but losing positions.
Adding too many weak terms creates a noisy heuristic that is hard to tune and expensive to run.
Ignoring feature scale can make one term dominate for accidental numeric reasons.
Failing to test the evaluation function through self-play or benchmark positions leaves you guessing about quality.

Summary

A good evaluation function estimates winning chances using a small set of strong, game-relevant features.
Terminal states must dominate the score so the engine never prefers a losing position with nice heuristics.
Simplicity matters because evaluation cost directly limits search depth.
Tune with self-play and benchmark positions, and remove features that do not improve actual decisions.