Hey Smogon,
We’ve been working on something wild over the last year: a full open-source Pokémon AI benchmark built around Pokémon battles.
It’s called the PokéAgent Challenge, and it’s going to be hosted at NeurIPS 2025 in San Diego this December.
Our goal is simple: use Pokémon as a real testbed for AI reasoning and learning.
There are two main agents powering the challenge with crazy performance on Gens 1,2,3,4 and 9 OU (and even VGC in the works!):
PokéChamp:
Large Language Model-based agents.
We use LLMs like ChatGPT, Claude, and Gemini to make think ahead, model opponents, and plan actions. It can even explain its choices like a human player.
→ GitHub: github.com/sethkarten/pokechamp
Metamon:
Reinforcement Learning agents that learn from experience: no scripts, no hand-coded rules.
→ GitHub: github.com/UT-Austin-RPL/metamon
Data
Did I mention we have the largest Pokemon battle dataset? Through a combination of human replays and bot ladder battles, we have almost 10M replays (and growing).
If you’re into bot development, battle analysis, or just want to see how close AI can get to real human play, check out:
https://pokeagent.github.io
Edit: Top methods from the results of our challenge have made it into the top 10 for gen1 and gen9 OU. More on this in our retrospective paper soon!

We’ve been working on something wild over the last year: a full open-source Pokémon AI benchmark built around Pokémon battles.
It’s called the PokéAgent Challenge, and it’s going to be hosted at NeurIPS 2025 in San Diego this December.
Our goal is simple: use Pokémon as a real testbed for AI reasoning and learning.
There are two main agents powering the challenge with crazy performance on Gens 1,2,3,4 and 9 OU (and even VGC in the works!):
PokéChamp:
Large Language Model-based agents.
We use LLMs like ChatGPT, Claude, and Gemini to make think ahead, model opponents, and plan actions. It can even explain its choices like a human player.
→ GitHub: github.com/sethkarten/pokechamp
Metamon:
Reinforcement Learning agents that learn from experience: no scripts, no hand-coded rules.
→ GitHub: github.com/UT-Austin-RPL/metamon
Data
Did I mention we have the largest Pokemon battle dataset? Through a combination of human replays and bot ladder battles, we have almost 10M replays (and growing).
If you’re into bot development, battle analysis, or just want to see how close AI can get to real human play, check out:
https://pokeagent.github.io
Edit: Top methods from the results of our challenge have made it into the top 10 for gen1 and gen9 OU. More on this in our retrospective paper soon!
Last edited:
