(Re-)Introducing Foul Play: A Competitive Pokemon Battle Bot

pmariglia · Jul 5, 2025

Hello once again Smogon,

It has been ... 6 years(!?) since I first posted about a Pokemon Showdown battle-bot that I've been working on. I'm here to share an update as I believe I've made some good progress.

The unnamed project has been re-branded as foul-play. It is still a singles focused battle-bot that only plays formats with species clause. It also does not support formats with mega-evolving, z-moves, or dynamax yet. It continues to rely on a search based engine that looks into the future and uses a static, hand-crafted evaluation function to guide the search, though I have completely re-written the Python battle-engine in Rust as poke-engine. One of the most impactful changes made is that minimax has been replaced by monte-carlo search. This was a game changer for Pokemon as monte-carlo search better deals with simultaneous move games.

How does it do? In my opinion it is an above average player when it is able to predict the opponent unknowns well. It is not able to consistently dominate the top of the ladder but it can get placements that are definitely impressive. Furthermore, as the engine doesn't understand battle mechanics with 100% accuracy, it is certainly exploitable if the opponent knows it is playing a bot.

Here are some results I've achieved with the bot. Note that (at least, imo) peak ELO is not always indicative of skill. Rank #4 in gen3ou for example was due to a lucky run. GXE values shown are all after Glicko deviation had dropped below 50.

If you would like to read a bit more about some of the techniques used by foul-play and poke-engine to achieve these rankings, as well as see some replays of the bot battling, check out: https://pmariglia.github.io/posts/foul-play

Unreality · Jul 5, 2025

this is the coolest project. great writeup, great work, and can't wait to see what's next for your bot

️

Vikr_107 · Jul 5, 2025

Sorry if I'm dumb and misunderstood it, but would it be possible to challenge it personally (send it a battle request) and use it as a training bot?

Drifting · Jul 5, 2025

Is there a way I can play against it via a challenge? I'd love to try it out if possible.

pmariglia · Jul 6, 2025

Vikr_107 said:
Sorry if I'm dumb and misunderstood it, but would it be possible to challenge it personally (send it a battle request) and use it as a training bot?

Drifting said:
Is there a way I can play against it via a challenge? I'd love to try it out if possible.

You'd have to build and run it yourself locally. I'd love to provide something that people can challenge but running Foul Play, especially at its strongest settings, takes a fair bit of resources.

Skarmory · Jul 6, 2025

This is fascinating! Do you have any of the replays associated with the bots run to #4 on the gen 3 OU ladder?

pmariglia · Jul 6, 2025

Skarmory said:
This is fascinating! Do you have any of the replays associated with the bots run to #4 on the gen 3 OU ladder?

There are a few sample replays towards the end of this page for a few different formats, including gen3ou: https://pmariglia.github.io/posts/foul-play

You can see a broader set of replays by going on https://replay.pokemonshowdown.com and searching for the two accounts I commonly tested on: Accelerock Ttar and Playing Foul. Unfortunately, these will be biased towards losses as I configured Foul Play to save replays on loss unless I was actively observing and found it interesting for some other reason.

Here are two gen3ou wins though:
- 1600 elo win
- 1800 elo win

Aiden0624 · Jul 9, 2025

pmariglia said:
You'd have to build and run it yourself locally. I'd love to provide something that people can challenge but running Foul Play, especially at its strongest settings, takes a fair bit of resources.

I have been trying to challenge myself for a minute but I cant seem to get it to send me a challenge request or even accept a challenge. Can you help with that?

dabiv · Jul 13, 2025

I kid you not I literally started working on a project like this for gen 3 specifically last month. I have a bot that uses extremely similar techniques to what you outlined, and I've been testing it on a local showdown server. It even beat me one time! (Attached replay) (I got too cocky)
The funny thing is I was using poke-engine for my simulator too, so I really should have noticed the recent updates and realized you were already working on this. Oh well, at least I can rest easy knowing that somebody who actually knows how to program has done better than I ever could.

Are you going to continue working on the bot to improve it, or are you satisfied with it as is? My lofty goal was to make something that was genuinely superhuman at playing pokemon (in my case just gen3ou), and I think that's definitely possible based on what you've managed here.

I have a few technical questions as well since I'm curious:

1) Are you predicting unrevealed pokemon on the opponents team, or just narrowing down and predicting sets of the revealed pokemon? If not, I've had moderate success doing so (utilizing teammate data for the revealed pokemon from smogon usage stats) and I think it could potentially lead to a significant improvement.

2) Have you looked into using sampling algorithms other than MCTS? I came across this paper which presents a new sampling algorithm and compares it to different variants of MCTS. It looks like you're using something similar to their MCTS-UCT algorithm, so potentially room for improvement there?

3) Have you tested the bot with different team archetypes or mostly the same teams? From the teams in the winning replays it looks like a lot of offense, so I was curious if it still performs well with more defensive/stall structures. I noticed with my bot that it often makes poor long term decisions or underestimates moves like toxic which have high but delayed impact.

Finally it goes without saying but this is a really awesome and exciting project so thanks for doing it!

pmariglia · Jul 15, 2025

Aiden0624 said:
I have been trying to challenge myself for a minute but I cant seem to get it to send me a challenge request or even accept a challenge. Can you help with that?

If you have a problem installing/running I'd suggest making a Github issue.

dabiv said:
I kid you not I literally started working on a project like this for gen 3 specifically last month. I have a bot that uses extremely similar techniques to what you outlined, and I've been testing it on a local showdown server. It even beat me one time! (Attached replay) (I got too cocky)
The funny thing is I was using poke-engine for my simulator too, so I really should have noticed the recent updates and realized you were already working on this. Oh well, at least I can rest easy knowing that somebody who actually knows how to program has done better than I ever could.

Are you going to continue working on the bot to improve it, or are you satisfied with it as is? My lofty goal was to make something that was genuinely superhuman at playing pokemon (in my case just gen3ou), and I think that's definitely possible based on what you've managed here.

I have a few technical questions as well since I'm curious:

1) Are you predicting unrevealed pokemon on the opponents team, or just narrowing down and predicting sets of the revealed pokemon? If not, I've had moderate success doing so (utilizing teammate data for the revealed pokemon from smogon usage stats) and I think it could potentially lead to a significant improvement.

2) Have you looked into using sampling algorithms other than MCTS? I came across this paper which presents a new sampling algorithm and compares it to different variants of MCTS. It looks like you're using something similar to their MCTS-UCT algorithm, so potentially room for improvement there?

3) Have you tested the bot with different team archetypes or mostly the same teams? From the teams in the winning replays it looks like a lot of offense, so I was curious if it still performs well with more defensive/stall structures. I noticed with my bot that it often makes poor long term decisions or underestimates moves like toxic which have high but delayed impact.

Finally it goes without saying but this is a really awesome and exciting project so thanks for doing it!

1) Yes. For random battles the unrevealed pokemon are sampled from the pool of Pokemon PS would put on the team. For formats like Gen3OU I do a very non-scientific sampling of the most likely pokemon. This does not intelligently try to infer how a team is composed, but it uses smogon's available usage statistics to do it's best. I'm sure there's room for improvement here

2) I have not. These seem interesting, I will definitely give them a read.

3) You're right, its mostly offense/balanced offense that the bot is good with. Long term decisions are hard to reason about with a search based engine that can only see perhaps 10 turns ahead in the best case. There have to be clear HP gains gains made in that time for the engine to see the value.

dabiv · Jul 17, 2025

pmariglia said:
If you have a problem installing/running I'd suggest making a Github issue.

1) Yes. For random battles the unrevealed pokemon are sampled from the pool of Pokemon PS would put on the team. For formats like Gen3OU I do a very non-scientific sampling of the most likely pokemon. This does not intelligently try to infer how a team is composed, but it uses smogon's available usage statistics to do it's best. I'm sure there's room for improvement here

2) I have not. These seem interesting, I will definitely give them a read.

3) You're right, its mostly offense/balanced offense that the bot is good with. Long term decisions are hard to reason about with a search based engine that can only see perhaps 10 turns ahead in the best case. There have to be clear HP gains gains made in that time for the engine to see the value.

Regarding 1), I did some experimenting and it looks like it is just filling in the sides with lvl 1 pikachus until there's six pokemon on each side (instead of guessing likely teammates). Is this just a placeholder that is ignored, or is there something I'm missing?

Also a couple other things I noticed from logs, it looks like the team converter is ignoring IVs, and the |teamsize| battle modifier isn't being processed/used (useful for when the opponent tries a 1 pokemon cheese strategy to prevent phazing with roar etc.). I'll make issues for these.

Example of what I'm talking about, the state being used and the result of the MCTS is identical despite the meaningful difference of teamsize = 6 vs teamsize = 1 for the opponent.

Let me know if I am being annoying, I'm new to this stuff.

tofa · Nov 2, 2025

I've been using this battle bot for the past few weeks to practice formats that are non-ladderable / ladders that are inactive, and this is some seriously impressive work.

The build process is very streamlined if you have Docker installed, which is very much appreciated.

To populate TeamDatasets for an arbitrary format, I used the sample sets available under https://play.pokemonshowdown.com/data/sets (converting to the correct structure which I think was enough). For SmogonDatasets, your existing bootstrapping method worked great. Once configured with these sets and after giving it some computational juice and teams to choose from, the bot is plug-and-play -- and plays really well!

Being able to tell the bot to select teams from a directory at random is another huge positive with this tool. The way it has been set up really makes it an invaluable prepping tool for tournaments in my eyes, as you can practice specific matchups or player profiles and you don't have the risk of information leakage that usually comes with testing on ladder. And of course, most importantly, the bot puts up a good challenge that makes testing worthwhile -- well done!!

Xuwu · Wednesday at 7:53 AM

I've tried this project and feel it has considerable potential. Would you be able to tell me how to set the optimal search-time-ms and search-parallelism?
Additionally, I read the evaluation function in your poke-engine and would like to ask how you selected and set the values inside it.

pmariglia · Saturday at 12:21 PM

tofa said:
I've been using this battle bot for the past few weeks to practice formats that are non-ladderable / ladders that are inactive, and this is some seriously impressive work.

The build process is very streamlined if you have Docker installed, which is very much appreciated.

To populate TeamDatasets for an arbitrary format, I used the sample sets available under https://play.pokemonshowdown.com/data/sets (converting to the correct structure which I think was enough). For SmogonDatasets, your existing bootstrapping method worked great. Once configured with these sets and after giving it some computational juice and teams to choose from, the bot is plug-and-play -- and plays really well!

Being able to tell the bot to select teams from a directory at random is another huge positive with this tool. The way it has been set up really makes it an invaluable prepping tool for tournaments in my eyes, as you can practice specific matchups or player profiles and you don't have the risk of information leakage that usually comes with testing on ladder. And of course, most importantly, the bot puts up a good challenge that makes testing worthwhile -- well done!!

Thank you so much for the kind words.

I completely forgot about that set data PS has. I am actually working on a data refactor right now and I'll have Foul Play auto-include that data.

Xuwu said:
I've tried this project and feel it has considerable potential. Would you be able to tell me how to set the optimal search-time-ms and search-parallelism?
Additionally, I read the evaluation function in your poke-engine and would like to ask how you selected and set the values inside it.

Search parallelism is going to be limited by your hardware so I can't give you an answer. Generally more is better, though I suspect there are diminishing returns to the number of battles searched after maybe 8, but I don't have data to back this up.

Search time is going to have an upper limit determined by the PS timer. PS gives 10 seconds per move (or was it 15?), so I usually put about 7 seconds per decision. However Foul Play will do 2 batches of searches if it detects that it has enough time left. Searching deeper will also require more memory, so this is also sort of limited by hardware.

Xuwu · Saturday at 8:08 PM

pmariglia said:
Thank you so much for the kind words.

I completely forgot about that set data PS has. I am actually working on a data refactor right now and I'll have Foul Play auto-include that data.

Search parallelism is going to be limited by your hardware so I can't give you an answer. Generally more is better, though I suspect there are diminishing returns to the number of battles searched after maybe 8, but I don't have data to back this up.

Search time is going to have an upper limit determined by the PS timer. PS gives 10 seconds per move (or was it 15?), so I usually put about 7 seconds per decision. However Foul Play will do 2 batches of searches if it detects that it has enough time left. Searching deeper will also require more memory, so this is also sort of limited by hardware.

Thank you very much for your patient reply. My current setup is the 7000+14 combination, and it feels like it's working quite well. However, based on what you said, I'll need to reduce it and test again.

I've run extensive tests in the past few days and achieved a peak of 1930+ (Silver Badge) on the gen9ou ladder, which truly shows its power.

If it's convenient, and as I asked in my second question before, I'm very interested to know how exactly you set the parameter values in the evaluation function.

I just watched your quarterfinal match, you crushed your opponent by a large margin. Good luck in your upcoming matches. I was wondering if the version you used in the tournament has been further optimized.

pmariglia · 2025-11-09T19:29:55-0500

Xuwu said:
Thank you very much for your patient reply. My current setup is the 7000+14 combination, and it feels like it's working quite well. However, based on what you said, I'll need to reduce it and test again.

I've run extensive tests in the past few days and achieved a peak of 1930+ (Silver Badge) on the gen9ou ladder, which truly shows its power.

If it's convenient, and as I asked in my second question before, I'm very interested to know how exactly you set the parameter values in the evaluation function.

I just watched your quarterfinal match, you crushed your opponent by a large margin. Good luck in your upcoming matches. I was wondering if the version you used in the tournament has been further optimized.

1930 is definitely impressive. I've had similar success in gen9ou as well.

Right I didn't answer your other question. The evaluation function was made using trial and error just based on my understanding of certain formats & metagames. I do not think it is completely optimized. Trying to tune it algorithmically is really slow unfortunately and I haven't had very much success with that - though I'm sure it is possible. What I do when making changes to it is run a lot of quick self-play matches with the previous version, and then if that shows promise I'll run it a bunch on the ladder.

Xuwu · 2025-11-09T21:03:46-0500

pmariglia said:
1930 is definitely impressive. I've had similar success in gen9ou as well.

Right I didn't answer your other question. The evaluation function was made using trial and error just based on my understanding of certain formats & metagames. I do not think it is completely optimized. Trying to tune it algorithmically is really slow unfortunately and I haven't had very much success with that - though I'm sure it is possible. What I do when making changes to it is run a lot of quick self-play matches with the previous version, and then if that shows promise I'll run it a bunch on the ladder.

Here's my line of thinking:

First, gather a large amount of match data. While the eval.rs function is already very smart and handles crucial non-linear interactions (like checking for Guts in evaluate_burned, or HeavyDutyBoots in evaluate_hazards), its "values" are still built on a set of hand-tuned const constants.

My plan is to write a script that extracts features from the logs as X variables, then runs a comprehensive logistic regression (possibly with L2/Ridge regularization) against the battle's final result (win/loss as the Y variable). I know this linear model is too simple to learn those complex interactions automatically. Instead, its goal is to find the optimal average weights for the function's existing linear "skeleton."

The model's resulting weights will give a data-driven baseline to manually adjust those const values in eval.rs. Finally, we can conduct A/B tests (new params vs. old params in a Bo99) to verify that these adjustments had a positive effect.
This "Analyze -> Tweak -> A/B Test" loop is still lengthy, and its ceiling is still limited by the hand-crafted interactions.

A more advanced, long-term optimization would be to move toward a full Reinforcement Learning (RL) loop, similar to AlphaZero. This would involve replacing the entire hand-crafted eval.rs function with a non-linear neural network. The training could then be fully automated via self-play, which would remove the manual bottleneck and allow the AI to automatically discover all those complex, non-linear synergies that we are currently trying to code by hand.

On that note, I've noticed the current eval.rs logic seems to overlook the Unaware ability. This aligns perfectly with what I observed in my bot's test runs, where my Kingambit repeatedly tries to boost against Dondozo, which is a futile strategy. It's a great example of why enhancing those hand-crafted interactions is so critical.

I look forward to you optimizing them better in the future!

(Re-)Introducing Foul Play: A Competitive Pokemon Battle Bot

pmariglia

Unreality

Vikr_107

Drifting

THIS USER'S POSTS HAVE BEEN VERIFIED BY CHIYOSUKE

pmariglia

Skarmory

pmariglia

Aiden0624

dabiv

Attachments

pmariglia

dabiv

tofa

Xuwu

pmariglia

Xuwu

pmariglia

Xuwu