Data An OU Bot

Status
Not open for further replies.
Approved by AM.

Hi, I wrote a bot that automatically plays on the OU ladder (EDIT: I asked Smogon staff members for permission before running it online). Currently, it is extremely bad. Its only win was against a poor chap who used toxic orbs on both a Breloom and Gliscor, neither of which had poison heal. Oops.
While bad now, it should theoretically learn over time, but I have know idea what the asymptotic skill level upon which it converges will be. Hopefully much higher.

I heard that there may already be plenty of other bots playing, but I haven't been able to find any info. Let me know if you have any!


This post explains the general idea of what is going on with. I discuss some general idea of the concepts, followed by more technical explanations hidden under tags.
I feel like my code looks like this, so I am a little embarrassed about it. I have never done anything like this before; I got super excited when reading about a lot of different concepts and ideas that I wanted to try them out and settled on Pokemon as a guinea pig. I would do a lot of things very differently.

There are three levels of learning going on:
1) Choosing a Team
2) Making inferences about the opponent's team
3) Deciding on actions



1) Team Choice
This is simple: the bot currently has a list of 52 teams, and records an action value for each. If a team wins, its action value goes up while a loss will lower it.
The probability of picking a team is related to the action value: the better the team, the more likely the bot will pick it.

The action values are related to means. After each game, they are updated according to the following formula:
if the team won:
new action value = old action value + alpha * (1 - old action value)
if the team lost:
new action value = old action value - alpha * old action value

If alpha were to equal 1 / total number of games the team played, the action value would be identical to the arithmetic mean. Given the assumption of a static metagame and skill level, it would also converge on the true value. In reality, neither of these assumptions are met: metagames change, and hopefully the AI's skill will increase, as well as those of its opponents as it slowly climbs the ladder. For this reason, an alpha that decreases over time is not ideal, and 0.05 is used.

To choose a team given the action values, the bot calculates
e^(action value / tau)
where tau equals 0.1, for each team.
The probability of picking a team equals this number divided by the sum of all such numbers for the teams. For example, if there were three teams with action values of 0.55, 0.5, and 0.4, we would have:
e^(0.55/0.1) = 241
e^(0.5/0.1) = 146
e^(0.4/0.1) = 55
Making their probabilities of getting chosen equal to 0.55, 0.33, and 0.12.

The constant tau decides how biased the choice is toward the higher rated teams: the lower tau is, the less likely it is to pick a team it thinks is bad.


2) Inferences about the opponent's team
While reality itself may be certain - which includes what your opponent is using - none of us are blessed to know it all.
When it comes to figuring things out, things don't get better than Bayes theorem (sorry, frequentists).
Imagine a huge number of different worlds, different realities where different facts hold true. We're living in one of them, but don't yet know which one it is.
In some worlds, Chansey holds eviolite and in others it holds leftovers - but from the get go, we can say the former is way more likely.
Every time something happens, like an attack that causes damage or one Pokemon out-speeding another, the bot asks "how likely was this to happen, given each possible reality?"
A Skarmory taking little damage from a physical attack is likely in a world where it has a maximum defense and hit points, less so in a world where it is specially defensive, and less still where it has the stat spread of a sweeper.
If something was more likely to happen under one belief, in one world, than another -- and then that something happened -- the bot will assign a higher probability to the belief it is living in that one, and a lower probability to the others.

In this way, if the bot sees a Skarmory take more damage from a Skarmory than expected, it will guess that it may be faster and hit harder than it thought, too.
In some cases though, this is very simple. "The opposing Skarmory restored a little HP using its Leftovers!" is going to appear with probability = 0 in the worlds where the Sarmory is wearing rocky helmets or shed shells.


Formally, Bayes theorem is:
p(H|E) = p(H) * p(E|H)/p(E)
meaning, the probability of a hypothesis given evidence equals:
the prior probability we assigned to that hypothesis, times how likely the evidence was to show up if that hypothesis were true divided by how likely it was to show up given the set of all hypothesis. You can substitute hypothesis for reality or world we're living in.

Prior probabilities come from Smogon's published stats. It is simple to get prior probabilities for items, abilities, and attacks.
Stat distributions are a little more complicated. If you look at the 1825-movesets file, we see the following list of spreads for Londorus-Therian, the most used Pokemon by the highest ranked players:
Spreads
Jolly:0/252/0/0/4/252 11.981%
Jolly:0/252/4/0/0/252 8.519%
Impish:252/4/252/0/0/0 5.306%
Jolly:0/252/24/0/0/232 5.005%
Impish:252/0/240/0/8/8 4.813%
Jolly:32/252/4/0/0/220 4.533%
Other 59.842%

The first two have basically the exact same stats, and it often gets worse when you look at the remaining 59.842% -- there are huge numbers of extremely similar hypothesis we can compare. We don't have the time to care about +/- 4 evs.
For each spread used more than (currently) 1 in 1,000 uses, the Pokemon's stats are calculated. The bot then uses affinity propogation to cluster the stat spreads so that we have a manageably small number of spreads that each Pokemon may belong to.
Kernel Density Estimates, with Gaussian kernels, are then used to represent the variation of each stat within each group. These (and the uniform range of *0.85 - 1.0 Pokemon damage calculators use) produce likelihood estimates ( p(E|H) ) for damage, as well as for who out-speeds whom.
Item and abilities are mixed in as well, so that the choice items, assault vests, sharp beaks, etc are considered as well.

This is slow, so the bot relies heavily on caching here. Whenever a pokemon is encountered for the first time, the initial state of uncertainty is saved to the hard drive and loaded. Each time an update on probabilities is performed, the cache loaded into memory is deleted and recalculated. When all six pokemon have not been encountered before, the bot will lose due to running out of time before the start of a game. If all have been seen, it will act quickly. I am creating caches for the top 200 most used pokemon, as I don't want it to be a jerk and make people wait on it.


3) Deciding on actions
To be succinct: an artificial neural network (NN) uses information on the game to compare the value of decisions and choose one.
This information includes:
- Weighted average of damage estimates for each of the bot's Pokemon's attacks vs each of the opponent's Pokemon. The weights are the probabilities assigned to each of the opponent's Pokemon's possibilities.
- Weighted average of damage estimates for each of the top four most probable opponent Pokemon's attacks vs each of the bot's Pokemon. The weights are the probabilities assigned to each of the opponent's Pokemon's possibilities.
- HP of each Pokemon, and as well as "effective HP" which equals HP - entry hazard damage for Pokemon on the bench.
- Status conditions
- General board condition information: eg weather, entry hazards, is a Pokemon locked into an attack or trapped?
- Probabilities of each Pokemon out-speeding one another
- Limited to only the Pokemon that are out:
- Weighted average of speed ratio of bot Pokemon's speed vs that of each of the opponent's Pokemon. I figured this would be useful for deciding whether or not to use a speed boosting move.
- Same as the above, comparing the opposing Pokemon to each of the bot's.
- General information on attacks. For example, priority, probability of burning the opponent, etc.

Given such information, the NN tries to predict the best action.
I am far from an expert on neural networks. There is a lot of theory related to how structure leads to what they are capable of learning; I have heard it said that ones understanding of this is what separates the men from the boys in machine learning. I am still a boy, with a lot to learn.
I am using two hidden layers with sigmoid neurons, as Pokemon play involves making a huge number of conditional decisions, eg:
- You may do something different depending on whether you're faster or they're faster. Unless priority is involved.
- Do something different depending on whether they're 1.49 times faster or 1.51 times faster (dragon dance, or not?).
- If they're burned, have status, etc, you may not want to try to burn them. Or maybe you do, if they might switch.
- Restoring health is great, unless you're already at full health or they can do more than 50% damage.
- Clearing entry hazards is great sometimes, but bad others - eg when your side is clear.
- Similar with laying them: stealth rock is awesome the first time, but less stellar the third.

The specific reasons WHY, is what is hard to learn. Sometimes power whipping is bad, because it missed. Which of the 897 data inputs explains that?
We humans are smart, because we can narrow the near infinite range of possible hypothesis down to a tangibly small number using logic: that they happened to be paralyzed, the fourth Pokemon in their party was KOed, some other attack has a chance to paralyze, and the fifth knows a priority move have nothing to do with that. We know: it was the accuracy, and the move just happened to miss. All the conditionals mentioned above are even more complicated to figure out.
The NN has a hard time; it needs a huge sample of games to rule out bad hypothesis and begin to narrow its focus on those that actually work.

An example of something wrong it learned: if the opponent isn't in range to be killed in one hit, use not very effective attacks to catch them on the switch with a heavy hit. This works sometimes, as using fire punch instead of outrage against their water type might catch their Ferrothorn...however, I've seen it use earthquake against someone's mono-flying team.
The true best hypothesis it should use is looking for the attacks that deal the most damage to the other benched Pokemon, not choosing attacks that deal little to those that are out.
Coding superstitious AI is easy. Logic and rationality are hard.

I used PyBrain's reinforcement learning package with a few modifications.
Currently I am using a NN with two hidden layers, the first with 180 sigmoid neurons and the second with 90. Two layers so that it can learn all the conditional decision boundaries.
In battling the AI vs itself, this size consistently outperformed a larger neural network. The larger one can learn more complex relationships -- including more complex superstitions -- so it needs a much larger sample of games before it can learn. I will keep training two sizes, and with a big enough sample of games the bigger one may eventually be able to outperform the smaller.

The network is trained with rprop minus, which simply shifts weights in the direction of reducing errors, rather than a percent of the distance down the error gradient like standard backprop.

Reinforcement Learning Paradigm
The NN gets fed 906 inputs: the 897 for the state, plus one more for each possible action it can choose saying whether or not that is the action being performed (the 9 actions are attacks 1-4 plus switching to Pokemon 2 through 6; mega evolution happens automatically and at the start of the game choosing to attack switches to Pokemon 1).

For each possible action, the neural network predicts an action value. Using a Gibbs softmax, like for choosing teams, it then picks its move. Tau decreases steadily over time, as it gets better at accurately predicting values.

When training the network, it looks at datasets of past games and tries to get the network to more accurately match each state-action pair to:
Currently estimated value + 0.5*(reward given + currently estimated value of the best action for the state that followed - currently estimated value)

Initially, I set the reward to 0, except when a game ends when it was +1 for a win and -1 for a loss. What this would do, is eventually have all values before the last turn shift to reflect the impact the decision has on its ability to eventually win the game: a good action is one that brings you to a state where you're more likely to win the game, and a bad action is one that brings you to a state where that is less likely.
This method bears some resemblance to consequentialism and evolution: it only sees the end, and while this is enough to ultimately reinforce everything we want (animals like eating food, cats like chasing things because that is the behavior that helps them find it) even though they're ultimately disconnected from the only criterion that matters. KOing a Pokemon should be reinforced, as it leads to a state form which the bot is more likely to win.
This, however, is slow. I am growing impatient with the bot's flagrant stupidity, and am feeling compelled to intervene and specifically reinforce behaviors I want to see. It took selective breeding far less time to produce a greyhound than it took evolution to produce cheetahs (although, admittedly, the latter is quicker - despite this being the greyhound's sole purpose in life -- AND the cheetah has thicker leg bones, suffering from running related injuries less often...).

The biggest reason of this change:
-As it is playing online vs real humans, as it will hardly ever win, I need to provide another source of reinforcement, lest all possible actions end in the same reward in which case all the AI will learn is "nothing matters". I don't want the neural network to suffer from something comparable to learned helplessness or depression. :p

I am going to provide a reward every time it KOs a Pokemon. I may consider rewarding / punishing others. Switching, perhaps, as the bot has an unnatural proclivity for switching.

The biggest change I made to PyBrain's code was adding "restriction lists": in Pokemon, some number of the 9 moves are often forbidden. You can't switch to a Pokemon that is knocked out, or use earthquake while locked into outrage.
Thus, when the NN makes decisions, the bot
a) removes the illegal moves from its list of possible actions, and
b) saves this list, so that when it learns from the game later it doesn't overestimate: "currently estimated value of the best action for the state that followed".

This , however, also allows me to forbid actions that are not mechanically illegal - just almost always terrible, if certain conditions are met, but otherwise good.


I can, intervene and forbid actions I think are almost always bad.
So far, I have forbidden:
-Using "Rain Dance" while it is already raining, if the opponent can't change the weather.
-Using stealth rock, if the opponent's side already has rocks and the opposing pokemon can't that is out can't learn defog or rapid spin.
-Continuing to use stat increasing moves after that stat already reached 6.
-Heal bell, when no one has a status condition.
-Defog or rapid spin if the opponent doesn't know a hazard move, and the bot's side doesn't currently have hazards. Rapid spin is allowed if the Pokemon is afflicted by leech seed.

Three questions:
-Any other ideas to add?
-Anyone know anything about machine learning?
-Anyone know anything about programming java script? I know nothing, and my code has awkwardly been running a browser, clicking buttons, and reading text. It would be faster (and probably more stable) to instead communicate to a server directly. I would rather not run any more Firefox windows than necessary; I could not get any headless browsers to successfully run java script.
 
Last edited:

RichieTheGarchomp

Banned deucer.
Hmm... I swear I heard that bots/cheating wasn't allowed. I would of thought that bots were a form of cheating, but I guess not.

Still, thats pretty cool anyways.
 

Disaster Area

formerly Piexplode
yah TFL greninja'd me; there's a lot to learn from david stone's work. If you get to something approaching that then I'm happy to help with suggestions and stuff :]

[Obi = david stone. PS you can still find him on IRC although he's idle a lot of the time. He'll be in #smogon and #pokemon at least when he's on, as well as #insidescoop]
 

leremyju

Banned deucer.
I mean the metagame is really diverse maybe you should start small like LC or PU. It would require learning a different tier but if you can make it work for that then it would be easier to apply to a higher tier
 
This is sick; I feel like they should specifically make a ladder where you can use bots, lol, it does seem kinda cheap, but hey, if it's not banned, who cares? Also, would it be possible to make a bot with some sort of AI that works like an amibo (don't know how to spell that :/)? That could be way cool!
 
This seems really interesting, it would be nice if we could have an army of them. However, I am uncertain if this is the correct place to post this. Good luck!
 
Hmm... I swear I heard that bots/cheating wasn't allowed. I would of thought that bots were a form of cheating, but I guess not.

Still, thats pretty cool anyways.
Maybe you were getting mixed up with spambots? I thought the same though.

Skipping team making would really help, machines are rarely useful for that. Just put a few good teams in there and let it pick between them. Also train the bot before going on the ladder with it.
 

blinkie

¯\_( ͡° ͜ʖ ͡°)_/¯ dank meme crew
There are a lot of Challenge Cup 1v1 bots maybe you could start there, as it would be a lot easier
 
Am I the only one not hyped about having bots on ladder? Hearthstone had a major epidemic a few months back, and playing against the things was hell (I heard one managed to reach Rank 1 Legend for a little while). While I doubt that this'll become as widespread as that, the whole ethics of the thing bothers me a bit. Most people play on Showdown to play other people, not bots (otherwise people would just play Battle Tower or something).

That said, I'm really impressed that you managed to put this together, so nice work.
 

Disaster Area

formerly Piexplode
If people actually made top level bots then we would handle it appropriately (and thanks to obi's work, the issue has been raised and discussed before so it's not desperately unfamiliar territory) but as it is I don't think anyone's made one anywhere as good as obi's..
 

Albacore

sludge bomb is better than sludge wave
is a Site Content Manager Alumnusis a Team Rater Alumnusis a Forum Moderator Alumnusis a Community Contributor Alumnusis a Contributor Alumnusis a Smogon Media Contributor Alumnusis a Battle Simulator Moderator Alumnus
This looks amazing! I also really want to see replays of this, mainly to see how the bot works, what misplays it commits and how it can be improved.

A few things I can think of to improve this :
- Preventing the bot from setting up beyond what is necessary to sweep the rest of the team.
- Making sure the bot tries to get SR up as soon as possible
- Using Substitute whenever the opposing Pokemon cannot break it
- Using recovery if the Pokemon is expected to be under 50% health by the end of the turn and if the opponent cannot 2HKO it.
- This is pretty vague but the more useful or needed a Pokemon is to win, the more it should be played conservatively and kept healthy. If a Pokemon is too weak to be of use it should not ba saved.
 

Disaster Area

formerly Piexplode
- This is pretty vague but the more useful or needed a Pokemon is to win, the more it should be played conservatively and kept healthy. If a Pokemon is too weak to be of use it should not ba saved.
depends on the tier
- Making sure the bot tries to get SR up as soon as possible
SR can be overvalued (read on obi's bot - obi had to halve the value of SR and it improved the bot's play massively)
- Using Substitute whenever the opposing Pokemon cannot break it
- Using recovery if the Pokemon is expected to be under 50% health by the end of the turn and if the opponent cannot 2HKO it.
Both are situational. Sub's useless in that scenario if your own health is below Sub-creating range, whilst using recover whilst poisoned might make the scenario worse.

You potentially have scenarios where both of these are true and that's conflicting.
 
Status
Not open for further replies.

Users Who Are Viewing This Thread (Users: 1, Guests: 0)

Top