PPO Agent playing LunarLander-v2

This is a trained model of a PPO agent playing LunarLander-v2 using the stable-baselines3 library.

Model Description

This model is a Proximal Policy Optimization (PPO) agent trained to solve the LunarLander-v2 environment from OpenAI Gymnasium. The agent learns to successfully land a lunar module by controlling its main engine and side thrusters while managing fuel consumption and landing precision.

Model Details

Algorithm: Proximal Policy Optimization (PPO)
Policy Network: Multi-Layer Perceptron (MlpPolicy)
Framework: Stable-Baselines3
Environment: LunarLander-v2 (Gymnasium)

Hyperparameters

The model was trained with the following PPO hyperparameters:

Parameter	Value
Policy	MlpPolicy
n_steps	1024
batch_size	64
n_epochs	4
gamma (discount factor)	0.999
gae_lambda	0.98
ent_coef (entropy coefficient)	0.01

Performance

Evaluation Results:

Mean Reward: 262.07 ± 21.35
Standard Deviation: 21.35

This performance indicates the agent has successfully learned to land the lunar module, as:

Rewards > 200 typically indicate successful landings
The positive mean reward shows consistent success across evaluation episodes
Low standard deviation suggests stable, reliable performance

Environment Details

LunarLander-v2 Environment:

Observation Space: 8-dimensional continuous space (position, velocity, angle, angular velocity, leg contact)
Action Space: 4 discrete actions (do nothing, fire left engine, fire main engine, fire right engine)
Success Criteria: Land between the flags with minimal fuel consumption and impact velocity
Reward Range: Approximately -∞ to +∞ (typically -200 to +300 for meaningful episodes)

Training Information

Training Framework: Stable-Baselines3
Environment Wrapper: Monitor (for episode statistics tracking)
Vectorized Environment: DummyVecEnv
Render Mode: rgb_array (for video recording)

Usage

import gymnasium as gym
from stable_baselines3 import PPO
from huggingface_sb3 import load_from_hub

# Load the model from Hugging Face Hub
model = load_from_hub(
    repo_id="Adilbai/ppo-LunarLander-v2",
    filename="ppo-LunarLander-v2.zip"
)

# Create environment
env = gym.make("LunarLander-v2", render_mode="human")

# Run the trained agent
obs, info = env.reset()
for _ in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, info = env.reset()
env.close()

Model Architecture

The PPO agent uses a Multi-Layer Perceptron (MLP) policy network that:

Takes 8-dimensional observations as input
Outputs action probabilities for 4 discrete actions
Includes separate policy and value function heads
Uses standard MLP layers with ReLU activations

Limitations and Considerations

Environment Specific: This model is specifically trained for LunarLander-v2 and may not generalize to other environments
Deterministic Evaluation: Performance metrics are based on deterministic policy evaluation
Sample Efficiency: PPO is generally sample-efficient but may require significant training time for optimal performance

Training Reproducibility

To reproduce this model's training:

from stable_baselines3 import PPO
import gymnasium as gym

env = gym.make("LunarLander-v2")

model = PPO(
    policy='MlpPolicy',
    env=env,
    n_steps=1024,
    batch_size=64,
    n_epochs=4,
    gamma=0.999,
    gae_lambda=0.98,
    ent_coef=0.01,
    verbose=1
)

model.learn(total_timesteps=500000)  # Adjust based on your training duration

Citation

If you use this model, please cite:

@misc{ppo_lunarlander_2024,
  title={PPO Agent for LunarLander-v2},
  author={[Your Name]},
  year={2024},
  publisher={Hugging Face Hub},
  url={https://huggingface.co/Adilbai/ppo-LunarLander-v2}
}

References

Downloads last month: 2

Video Preview

Reinforcement Learning

Paper for Adilbai/ppo-LunarLander-v2

Proximal Policy Optimization Algorithms

Paper • 1707.06347 • Published Jul 20, 2017 • 11

Evaluation results

mean_reward on LunarLander-v2
self-reported

259.57 +/- 24.07