OpenAI Gym
CartPole-v0
CartPole-v1
reinforcement learning
AI environments

Difference between OpenAI Gym environments 'CartPole-v0' and 'CartPole-v1'

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

CartPole-v0 and CartPole-v1 use the same core dynamics, action space, and observation format, but they differ in episode horizon and success criteria. Those two differences materially affect training curves and benchmark interpretation. If you compare agents across versions without normalizing these settings, results can be misleading.

Shared Mechanics Across Both Versions

Both versions model the same control task:

  • state includes cart position, cart velocity, pole angle, and pole angular velocity
  • action space is discrete with two actions, push left or push right
  • reward is typically one point per time step while the episode remains alive
  • failure occurs when pole angle or cart position leaves allowed bounds

A basic interaction loop looks the same for both versions.

python
1import gym
2
3env = gym.make("CartPole-v1")
4obs = env.reset()
5
6done = False
7total_reward = 0
8
9while not done:
10    action = env.action_space.sample()
11    obs, reward, done, info = env.step(action)
12    total_reward += reward
13
14print(total_reward)
15env.close()

The agent API usage is identical. The important differences are in environment limits and benchmark thresholds.

Key Difference 1: Maximum Episode Length

The largest practical difference is episode cap.

  • CartPole-v0 max episode steps is 200.
  • CartPole-v1 max episode steps is 500.

This means an agent that balances reliably for 200 steps may still fail to sustain control for 500 steps. Training on v1 therefore pushes policies to be stable over longer horizons.

In experiments, this changes both expected returns and convergence behavior. A policy that appears solved quickly on v0 can still be underpowered for v1.

Key Difference 2: Solved Threshold Expectations

Classic benchmarks use different solved thresholds for each version.

  • v0 commonly uses average reward target near 195 over evaluation episodes.
  • v1 commonly uses a much higher target near 475 due to longer horizon.

Because reward per step is similar, thresholds scale with maximum steps. Comparing raw reward values between versions without accounting for this shift is not meaningful.

Impact on Reinforcement Learning Experiments

When switching from v0 to v1, expect:

  • longer training before reaching solved thresholds
  • wider variance early in training
  • greater sensitivity to instability in near-terminal states

Algorithms that optimize short-horizon stability may look strong on v0 but underperform on v1. For fair comparison, keep environment version fixed across baselines.

Reproducibility and Version Control

Environment version should be treated like a hyperparameter. Log it alongside seed, reward settings, and library versions.

python
1config = {
2    "env_id": "CartPole-v1",
3    "seed": 123,
4    "algo": "DQN",
5    "gamma": 0.99,
6}
7print(config)

Without explicit version logging, reproduced runs may silently target a different benchmark and invalidate comparison tables.

Practical Migration from v0 to v1

If you have legacy code based on v0, migration steps are usually straightforward:

  1. change environment ID to v1
  2. update evaluation threshold and reporting targets
  3. increase training budget if previously tuned to 200-step episodes
  4. re-check early stopping criteria
  5. compare normalized metrics such as percentage of max return

Normalized reporting example:

python
1max_steps = 500  # v1
2avg_return = 420
3normalized = avg_return / max_steps
4print(normalized)

Normalized scores help compare policy quality across horizon changes.

Typical Mistakes in CartPole Benchmarking

Common analysis mistakes include:

  • mixing v0 and v1 runs in one chart without labels
  • declaring solved by using old v0 threshold in v1 experiments
  • keeping identical training budget despite longer v1 episodes
  • reporting best single episode instead of stable average

Fixing these issues improves benchmark credibility significantly.

Code Snippet for Side-by-Side Evaluation

The following snippet demonstrates evaluating both versions with the same random policy, which highlights expected return differences.

python
1import gym
2
3
4def evaluate(env_id, episodes=20):
5    env = gym.make(env_id)
6    totals = []
7    for _ in range(episodes):
8        obs = env.reset()
9        done = False
10        total = 0
11        while not done:
12            action = env.action_space.sample()
13            obs, reward, done, info = env.step(action)
14            total += reward
15        totals.append(total)
16    env.close()
17    return sum(totals) / len(totals)
18
19
20print("v0 avg:", evaluate("CartPole-v0"))
21print("v1 avg:", evaluate("CartPole-v1"))

Even with random actions, horizon differences influence observed returns.

Common Pitfalls

  • Comparing v0 and v1 reward numbers directly without normalization.
  • Using v0 solved threshold for v1 training runs.
  • Forgetting to log environment version and seed in experiment metadata.
  • Migrating environment version without retuning training budget and stop criteria.
  • Publishing benchmark tables that mix environment versions in the same metric column.

Summary

  • CartPole-v0 and CartPole-v1 share dynamics but differ in episode cap and success expectations.
  • v1 is a stricter long-horizon benchmark than v0.
  • Environment version must be treated as a controlled experiment parameter.
  • Migration from v0 to v1 requires threshold and budget updates, not just ID replacement.
  • Clear labeling and normalized reporting are essential for fair RL comparisons.

Course illustration
Course illustration

All Rights Reserved.