• Check out the relaunch of our general collection, with classic designs and new ones by our very own Pissog!
  • The moderator of this forum is jetou.
  • Welcome to Smogon! Take a moment to read the Introduction to Smogon for a run-down on everything Smogon, and make sure you take some time to read the global rules.

(Re-)Introducing Foul Play: A Competitive Pokemon Battle Bot

Hello once again Smogon,

It has been ... 6 years(!?) since I first posted about a Pokemon Showdown battle-bot that I've been working on. I'm here to share an update as I believe I've made some good progress.

The unnamed project has been re-branded as foul-play. It is still a singles focused battle-bot that only plays formats with species clause. It also does not support formats with mega-evolving, z-moves, or dynamax yet. It continues to rely on a search based engine that looks into the future and uses a static, hand-crafted evaluation function to guide the search, though I have completely re-written the Python battle-engine in Rust as poke-engine. One of the most impactful changes made is that minimax has been replaced by monte-carlo search. This was a game changer for Pokemon as monte-carlo search better deals with simultaneous move games.

How does it do? In my opinion it is an above average player when it is able to predict the opponent unknowns well. It is not able to consistently dominate the top of the ladder but it can get placements that are definitely impressive. Furthermore, as the engine doesn't understand battle mechanics with 100% accuracy, it is certainly exploitable if the opponent knows it is playing a bot.

Here are some results I've achieved with the bot. Note that (at least, imo) peak ELO is not always indicative of skill. Rank #4 in gen3ou for example was due to a lucky run. GXE values shown are all after Glicko deviation had dropped below 50.
1751723458751.png


If you would like to read a bit more about some of the techniques used by foul-play and poke-engine to achieve these rankings, as well as see some replays of the bot battling, check out: https://pmariglia.github.io/posts/foul-play
 
Sorry if I'm dumb and misunderstood it, but would it be possible to challenge it personally (send it a battle request) and use it as a training bot?
 
Sorry if I'm dumb and misunderstood it, but would it be possible to challenge it personally (send it a battle request) and use it as a training bot?
Is there a way I can play against it via a challenge? I'd love to try it out if possible.
You'd have to build and run it yourself locally. I'd love to provide something that people can challenge but running Foul Play, especially at its strongest settings, takes a fair bit of resources.
 
This is fascinating! Do you have any of the replays associated with the bots run to #4 on the gen 3 OU ladder?

There are a few sample replays towards the end of this page for a few different formats, including gen3ou: https://pmariglia.github.io/posts/foul-play

You can see a broader set of replays by going on https://replay.pokemonshowdown.com and searching for the two accounts I commonly tested on: Accelerock Ttar and Playing Foul. Unfortunately, these will be biased towards losses as I configured Foul Play to save replays on loss unless I was actively observing and found it interesting for some other reason.

Here are two gen3ou wins though:
- 1600 elo win
- 1800 elo win
 
You'd have to build and run it yourself locally. I'd love to provide something that people can challenge but running Foul Play, especially at its strongest settings, takes a fair bit of resources.
I have been trying to challenge myself for a minute but I cant seem to get it to send me a challenge request or even accept a challenge. Can you help with that?
 
I kid you not I literally started working on a project like this for gen 3 specifically last month. I have a bot that uses extremely similar techniques to what you outlined, and I've been testing it on a local showdown server. It even beat me one time! (Attached replay) (I got too cocky)
The funny thing is I was using poke-engine for my simulator too, so I really should have noticed the recent updates and realized you were already working on this. Oh well, at least I can rest easy knowing that somebody who actually knows how to program has done better than I ever could.

Are you going to continue working on the bot to improve it, or are you satisfied with it as is? My lofty goal was to make something that was genuinely superhuman at playing pokemon (in my case just gen3ou), and I think that's definitely possible based on what you've managed here.

I have a few technical questions as well since I'm curious:

1) Are you predicting unrevealed pokemon on the opponents team, or just narrowing down and predicting sets of the revealed pokemon? If not, I've had moderate success doing so (utilizing teammate data for the revealed pokemon from smogon usage stats) and I think it could potentially lead to a significant improvement.

2) Have you looked into using sampling algorithms other than MCTS? I came across this paper which presents a new sampling algorithm and compares it to different variants of MCTS. It looks like you're using something similar to their MCTS-UCT algorithm, so potentially room for improvement there?

3) Have you tested the bot with different team archetypes or mostly the same teams? From the teams in the winning replays it looks like a lot of offense, so I was curious if it still performs well with more defensive/stall structures. I noticed with my bot that it often makes poor long term decisions or underestimates moves like toxic which have high but delayed impact.

Finally it goes without saying but this is a really awesome and exciting project so thanks for doing it!
 

Attachments

Last edited:
I have been trying to challenge myself for a minute but I cant seem to get it to send me a challenge request or even accept a challenge. Can you help with that?
If you have a problem installing/running I'd suggest making a Github issue.


I kid you not I literally started working on a project like this for gen 3 specifically last month. I have a bot that uses extremely similar techniques to what you outlined, and I've been testing it on a local showdown server. It even beat me one time! (Attached replay) (I got too cocky)
The funny thing is I was using poke-engine for my simulator too, so I really should have noticed the recent updates and realized you were already working on this. Oh well, at least I can rest easy knowing that somebody who actually knows how to program has done better than I ever could.

Are you going to continue working on the bot to improve it, or are you satisfied with it as is? My lofty goal was to make something that was genuinely superhuman at playing pokemon (in my case just gen3ou), and I think that's definitely possible based on what you've managed here.

I have a few technical questions as well since I'm curious:

1) Are you predicting unrevealed pokemon on the opponents team, or just narrowing down and predicting sets of the revealed pokemon? If not, I've had moderate success doing so (utilizing teammate data for the revealed pokemon from smogon usage stats) and I think it could potentially lead to a significant improvement.

2) Have you looked into using sampling algorithms other than MCTS? I came across this paper which presents a new sampling algorithm and compares it to different variants of MCTS. It looks like you're using something similar to their MCTS-UCT algorithm, so potentially room for improvement there?

3) Have you tested the bot with different team archetypes or mostly the same teams? From the teams in the winning replays it looks like a lot of offense, so I was curious if it still performs well with more defensive/stall structures. I noticed with my bot that it often makes poor long term decisions or underestimates moves like toxic which have high but delayed impact.

Finally it goes without saying but this is a really awesome and exciting project so thanks for doing it!
1) Yes. For random battles the unrevealed pokemon are sampled from the pool of Pokemon PS would put on the team. For formats like Gen3OU I do a very non-scientific sampling of the most likely pokemon. This does not intelligently try to infer how a team is composed, but it uses smogon's available usage statistics to do it's best. I'm sure there's room for improvement here

2) I have not. These seem interesting, I will definitely give them a read.

3) You're right, its mostly offense/balanced offense that the bot is good with. Long term decisions are hard to reason about with a search based engine that can only see perhaps 10 turns ahead in the best case. There have to be clear HP gains gains made in that time for the engine to see the value.
 
If you have a problem installing/running I'd suggest making a Github issue.



1) Yes. For random battles the unrevealed pokemon are sampled from the pool of Pokemon PS would put on the team. For formats like Gen3OU I do a very non-scientific sampling of the most likely pokemon. This does not intelligently try to infer how a team is composed, but it uses smogon's available usage statistics to do it's best. I'm sure there's room for improvement here

2) I have not. These seem interesting, I will definitely give them a read.

3) You're right, its mostly offense/balanced offense that the bot is good with. Long term decisions are hard to reason about with a search based engine that can only see perhaps 10 turns ahead in the best case. There have to be clear HP gains gains made in that time for the engine to see the value.

Regarding 1), I did some experimenting and it looks like it is just filling in the sides with lvl 1 pikachus until there's six pokemon on each side (instead of guessing likely teammates). Is this just a placeholder that is ignored, or is there something I'm missing?

Also a couple other things I noticed from logs, it looks like the team converter is ignoring IVs, and the |teamsize| battle modifier isn't being processed/used (useful for when the opponent tries a 1 pokemon cheese strategy to prevent phazing with roar etc.). I'll make issues for these.


Example of what I'm talking about, the state being used and the result of the MCTS is identical despite the meaningful difference of teamsize = 6 vs teamsize = 1 for the opponent.
NVIDIA_Overlay_jGv1DMPt2N.png
NVIDIA_Overlay_odou46KRN9.png



Let me know if I am being annoying, I'm new to this stuff.
 
I've been using this battle bot for the past few weeks to practice formats that are non-ladderable / ladders that are inactive, and this is some seriously impressive work.

The build process is very streamlined if you have Docker installed, which is very much appreciated.

To populate TeamDatasets for an arbitrary format, I used the sample sets available under https://play.pokemonshowdown.com/data/sets (converting to the correct structure which I think was enough). For SmogonDatasets, your existing bootstrapping method worked great. Once configured with these sets and after giving it some computational juice and teams to choose from, the bot is plug-and-play -- and plays really well!

Being able to tell the bot to select teams from a directory at random is another huge positive with this tool. The way it has been set up really makes it an invaluable prepping tool for tournaments in my eyes, as you can practice specific matchups or player profiles and you don't have the risk of information leakage that usually comes with testing on ladder. And of course, most importantly, the bot puts up a good challenge that makes testing worthwhile -- well done!!
 
I've tried this project and feel it has considerable potential. Would you be able to tell me how to set the optimal search-time-ms and search-parallelism?
Additionally, I read the evaluation function in your poke-engine and would like to ask how you selected and set the values inside it.
 
Last edited:
I've been using this battle bot for the past few weeks to practice formats that are non-ladderable / ladders that are inactive, and this is some seriously impressive work.

The build process is very streamlined if you have Docker installed, which is very much appreciated.

To populate TeamDatasets for an arbitrary format, I used the sample sets available under https://play.pokemonshowdown.com/data/sets (converting to the correct structure which I think was enough). For SmogonDatasets, your existing bootstrapping method worked great. Once configured with these sets and after giving it some computational juice and teams to choose from, the bot is plug-and-play -- and plays really well!

Being able to tell the bot to select teams from a directory at random is another huge positive with this tool. The way it has been set up really makes it an invaluable prepping tool for tournaments in my eyes, as you can practice specific matchups or player profiles and you don't have the risk of information leakage that usually comes with testing on ladder. And of course, most importantly, the bot puts up a good challenge that makes testing worthwhile -- well done!!
Thank you so much for the kind words.

I completely forgot about that set data PS has. I am actually working on a data refactor right now and I'll have Foul Play auto-include that data.

I've tried this project and feel it has considerable potential. Would you be able to tell me how to set the optimal search-time-ms and search-parallelism?
Additionally, I read the evaluation function in your poke-engine and would like to ask how you selected and set the values inside it.
Search parallelism is going to be limited by your hardware so I can't give you an answer. Generally more is better, though I suspect there are diminishing returns to the number of battles searched after maybe 8, but I don't have data to back this up.

Search time is going to have an upper limit determined by the PS timer. PS gives 10 seconds per move (or was it 15?), so I usually put about 7 seconds per decision. However Foul Play will do 2 batches of searches if it detects that it has enough time left. Searching deeper will also require more memory, so this is also sort of limited by hardware.
 
Thank you so much for the kind words.

I completely forgot about that set data PS has. I am actually working on a data refactor right now and I'll have Foul Play auto-include that data.


Search parallelism is going to be limited by your hardware so I can't give you an answer. Generally more is better, though I suspect there are diminishing returns to the number of battles searched after maybe 8, but I don't have data to back this up.

Search time is going to have an upper limit determined by the PS timer. PS gives 10 seconds per move (or was it 15?), so I usually put about 7 seconds per decision. However Foul Play will do 2 batches of searches if it detects that it has enough time left. Searching deeper will also require more memory, so this is also sort of limited by hardware.
Thank you very much for your patient reply. My current setup is the 7000+14 combination, and it feels like it's working quite well. However, based on what you said, I'll need to reduce it and test again.

I've run extensive tests in the past few days and achieved a peak of 1930+ (Silver Badge) on the gen9ou ladder, which truly shows its power.

If it's convenient, and as I asked in my second question before, I'm very interested to know how exactly you set the parameter values in the evaluation function.

I just watched your quarterfinal match, you crushed your opponent by a large margin. Good luck in your upcoming matches. I was wondering if the version you used in the tournament has been further optimized.
 
Thank you very much for your patient reply. My current setup is the 7000+14 combination, and it feels like it's working quite well. However, based on what you said, I'll need to reduce it and test again.

I've run extensive tests in the past few days and achieved a peak of 1930+ (Silver Badge) on the gen9ou ladder, which truly shows its power.

If it's convenient, and as I asked in my second question before, I'm very interested to know how exactly you set the parameter values in the evaluation function.

I just watched your quarterfinal match, you crushed your opponent by a large margin. Good luck in your upcoming matches. I was wondering if the version you used in the tournament has been further optimized.
1930 is definitely impressive. I've had similar success in gen9ou as well.

Right I didn't answer your other question. The evaluation function was made using trial and error just based on my understanding of certain formats & metagames. I do not think it is completely optimized. Trying to tune it algorithmically is really slow unfortunately and I haven't had very much success with that - though I'm sure it is possible. What I do when making changes to it is run a lot of quick self-play matches with the previous version, and then if that shows promise I'll run it a bunch on the ladder.
 
1930 is definitely impressive. I've had similar success in gen9ou as well.

Right I didn't answer your other question. The evaluation function was made using trial and error just based on my understanding of certain formats & metagames. I do not think it is completely optimized. Trying to tune it algorithmically is really slow unfortunately and I haven't had very much success with that - though I'm sure it is possible. What I do when making changes to it is run a lot of quick self-play matches with the previous version, and then if that shows promise I'll run it a bunch on the ladder.
Here's my line of thinking:

First, gather a large amount of match data. While the eval.rs function is already very smart and handles crucial non-linear interactions (like checking for Guts in evaluate_burned, or HeavyDutyBoots in evaluate_hazards), its "values" are still built on a set of hand-tuned const constants.

My plan is to write a script that extracts features from the logs as X variables, then runs a comprehensive logistic regression (possibly with L2/Ridge regularization) against the battle's final result (win/loss as the Y variable). I know this linear model is too simple to learn those complex interactions automatically. Instead, its goal is to find the optimal average weights for the function's existing linear "skeleton."

The model's resulting weights will give a data-driven baseline to manually adjust those const values in eval.rs. Finally, we can conduct A/B tests (new params vs. old params in a Bo99) to verify that these adjustments had a positive effect.
This "Analyze -> Tweak -> A/B Test" loop is still lengthy, and its ceiling is still limited by the hand-crafted interactions.

A more advanced, long-term optimization would be to move toward a full Reinforcement Learning (RL) loop, similar to AlphaZero. This would involve replacing the entire hand-crafted eval.rs function with a non-linear neural network. The training could then be fully automated via self-play, which would remove the manual bottleneck and allow the AI to automatically discover all those complex, non-linear synergies that we are currently trying to code by hand.

On that note, I've noticed the current eval.rs logic seems to overlook the Unaware ability. This aligns perfectly with what I observed in my bot's test runs, where my Kingambit repeatedly tries to boost against Dondozo, which is a futile strategy. It's a great example of why enhancing those hand-crafted interactions is so critical.

I look forward to you optimizing them better in the future!
 
Last edited:
Back
Top