Other Using Usage Statistics to Analyze Metagame Balance

migzoo

new money
Hi Smogon!

You probably don't know me, as I am not very active on the forums, but I am active Showdown user who has a particular interest in math and the power of data analysis in competitive Pokémon. I was recently thinking about how suspect testing and ban discussions tend to be extremely subjective. This led me to read this thread: http://www.smogon.com/forums/threads/characteristics-of-a-desirable-pokemon-metagame.66515/. Doug identifies Balance as one of the main desired characteristics of a metagame, and one that should be taken into heavy consideration during the banning/suspecting process. But I noticed that one of the main responses is that Balance, as well as the other characteristics he identifies, is hard to quantify. So I, inspired in part by the Gini Coefficient I learned about in microeconomics, attempted to formulate a way to quantify the relative balance of a given metagame at a given point of time by analyzing usage statistics. The fruits of my labor are contained in the Excel file I attached.

Now, on to my methodology. I started by pulling the usage percentages of the top 100 most used Pokémon in Gen 5 OU for every month, from April 2011 to November 2013. For each month, I plot those 100 points: the rank in the usage stats is the x-coordinate and the usage percentage is the y-coordinate. From here I ran a logarithmic regression for each month's data set, most of which yielded a respectable R^2 value of around .98. Now, logically speaking, a metagame in which there is a sharp drop-off in usage beyond the first handful of top Pokémon is relatively unbalanced, by Doug's definition of Balance. So one way to analyze balance is to look at the regression: the more negative the coefficient of the logarithmic term, the less balanced the metagame is. As points of comparison, I also calculated the 20:20 ratio and Palma ratio for each month's statistics (two metrics commonly used to calculate income inequality). I was also going to calculate Gini, Hoover, and Theil indices but those are more tedious, and I was lazy. So I then plotted the logarithmic coefficients, 20:20 ratios, and Palma ratios over time. In comparison, the graphs of the three different balance metrics appear to be very consistent. By looking at those graphs, we theoretically have a post-mortem of the Gen 5 OU metagame: valleys indicate relative balance, peaks indicate relative unbalance. Also attached is a picture of the logarithmic coefficient graph, marked with bans and significant events that may have affected the balance of the metagame. And some of the trends make sense: during the early stages of BW and BW2, the tier appeared relatively balanced as people were getting a feel for the new metagame. And the downward trend from September 2011 onward indicated that Smogon's bans were generally good for the balance of the OU metagame.



If this idea of mine takes off, it could have significant implications for Smogon policy. We could compare the balance of a metagame before and during a suspect test, and use this to help make a ban verdict. In a perfect world, we would have ways to quantify all of the characteristics that Doug described in his post. That would help shift the metagame development process one from being subjective to more objective. As of now, the only empirical evidence commonly used in suspect discussions are damage calcs (lol).

So anyways, if you managed to read all of this, congrats and thanks. Let me know what you think!
 

Attachments

Hey, great thinking using income-inequality concept to measure meta-game balance!
Was just wondering if you or anyone checking here knows anything about the algorithms they use to decide bans? I remember they used something that worked on a basis of players getting an certain high ELO score. If they could get a very high ELO with X pokemon against players ot using that pokemon, then X was OP. One advantage here is that this relies on a number that reflects actual wins with that pokemon, not just its popularity.

You could apply the same GINI curve principle to ELO scores with certain pokemon involved. But I dunno what the stats we have are like....where does smogon let you get at all their data?
 

Disaster Area

formerly Piexplode
This is an interesting concept; sorry bout the necrobump; I'm curious, is it possible to make a format where you can input usage stat documents over a series of time that's more user friendly than it currently is, or does it require manually inputting data? I'm working on essentially a project of making my own fakemons meta (it's a bit of a leap from CAP's concept) and a tool like this would be very interesting to have :]
 
Wow this is really awesome; I would love to be able to compare and contrast the current metagame with past gens that were deemed "balanced" (such as Gen 3). This would really come in handy with the current suspect test on Metagrossite. I think you're absolutely right in that it's laughable that the only empirical evidence used in suspect tests are damage calculations. I find our community's ability to shape tiering policy to be a bit…questionable at times.

I'm wondering, have you been keeping up with the current metagame or no?
 
Measuring balance like this certainly is something new. But I would not go as far as using it to verdict a suspect test.

When you look at trends it's better to observe what's happening in the long run - whatever happens right away is not that important. The increase in balance after a ban can be explained by players trying out new stuff in a new metagame, which inherently makes it more diverse. That's the short term, but in the long term? The graph slowly rises back to a peak as high as the pre-suspect ban. Something that was not broken pre-ban is now broken, and we're back to square one again.

While you can interpret the bans having a good effect on the meta (in the short term), you can also interpret it as Smogon really is doing nothing but creating an endless cycle of bans (in the long term). I'm sure if we do something stupid like ban Heatran or Rotom-W, there's going to be a balance surge in the immediate 1-2 months, then a sharp rise to unbalanced a few months later.

If at all, the average trend seems to be going up - towards unbalance:



IMO these calculations should not be part of any suspect test.
 
Last edited:

migzoo

new money
I'm wondering, have you been keeping up with the current metagame or no?
I've been busy with school lately, but I keep up with the RU metagame.

I don't think I understand your point. In the second and third paragraphs you argue that what matters is the long term, and that Smogon's bans do nothing in the long term. But then you suggest that empirical evidence should not be used in suspect tests? If anything, this data is the best we've got in terms of visualizing the long term. I'd like to clarify that I don't want these calculations to be the sole basis of metagame decisions, but rather a piece of evidence that can be used in metagame/suspect discussions.
 
I don't think I understand your point. In the second and third paragraphs you argue that what matters is the long term, and that Smogon's bans do nothing in the long term. But then you suggest that empirical evidence should not be used in suspect tests? If anything, this data is the best we've got in terms of visualizing the long term. I'd like to clarify that I don't want these calculations to be the sole basis of metagame decisions, but rather a piece of evidence that can be used in metagame/suspect discussions.
Having this data is certainly better than none. But short term figures have little use in telling us whether a ban was appropriate or not. That's all I'm saying. Like how does comparing the balance of a suspect ladder mean anything? It's a new ladder, you can't be certain things are going to stay that way. Not unless you're able to predict how the metagame is like when it settles down.

While I'm at it, might as well spill out some other problems I see in your model:
  • It does not measure balance at all. It's measuring what we call "centralization" (or what Doug refers to it as, "variety"). Balance itself is way more than just equality of usage. Remember when Greninja was around, unviable stuff like Porygon2 started getting more usage to counter it. This is a bad thing. However, more Porygon2 usage = less centralization, which inherently shows up as a good thing on your graphs.
  • Doug says "variety is the spice of life but too much variety is chaos". So having perfect 20:20, Palma etc ratios is actually a bad thing. For your measurement to work it should be compared to the ratio of a "balanced" metagame. I don't how we should determine this number, but I know it's not going to be anywhere near 1. Basically, it means the troughs on your graphs are not necessarily a good thing either.
  • You didn't actually calculate the proper 20:20 or Palma ratios - you assumed pokemon ranked 81-100 are the bottom 20%. If done correctly it should be the true bottom 20%. Now I'm aware doing it this way is meaningless because the bottom 20% mons are not even relevant to the metagame. The problem with 20:20 and Palma is that they don't model our metagame very well. A better way to do it is using stuff like Gini index (like you mentioned) or the lambda of the exponential distribution. These look at the overall picture, and not just the top and bottom x%.

Don't take it that I'm criticizing you for the sake of it, I'm doing so in case it might help you improve a few things. Perhaps the last two points are the same problem - that you somehow decided that a perfectly "balanced" uncentralized metagame consists of the top 100 pokemon having the same usage.
 

migzoo

new money
  • It does not measure balance at all. It's measuring what we call "centralization" (or what Doug refers to it as, "variety"). Balance itself is way more than just equality of usage. Remember when Greninja was around, unviable stuff like Porygon2 started getting more usage to counter it. This is a bad thing. However, more Porygon2 usage = less centralization, which inherently shows up as a good thing on your graphs.
Fair enough. The definitions of balance and centralization are hazy at best.
  • Doug says "variety is the spice of life but too much variety is chaos". So having perfect 20:20, Palma etc ratios is actually a bad thing. For your measurement to work it should be compared to the ratio of a "balanced" metagame. I don't how we should determine this number, but I know it's not going to be anywhere near 1. Basically, it means the troughs on your graphs are not necessarily a good thing either.
  • You didn't actually calculate the proper 20:20 or Palma ratios - you assumed pokemon ranked 81-100 are the bottom 20%. If done correctly it should be the true bottom 20%. Now I'm aware doing it this way is meaningless because the bottom 20% mons are not even relevant to the metagame. The problem with 20:20 and Palma is that they don't model our metagame very well. A better way to do it is using stuff like Gini index (like you mentioned) or the lambda of the exponential distribution. These look at the overall picture, and not just the top and bottom x%.
This project is still a work in progress; I plan on implementing Gini in the future. I only used the top 100 and 20:20/Palma for simplicity's sake. On the matter of the "perfect" number, it's impossible to come up with such a number objectively (not that the community hasn't come up with arbitrary numbers before). Different people have different preferences with regards to the metagame, so they will have different ideas of what is "too centralized/balanced" or "not centralized/balanced enough." Until someone comes up with such a number (if ever), I think this metric will still be useful for determining relative centralization/balance.

Thanks for the recent rise of feedback, everyone; there was little in the way of actual feedback when I first posted this, so your responses are much appreciated. I have been re-motivated to work on this project. Also, I encourage others to work on this on their own; I might make this project open-source when I dig up the script I wrote to parse usage data.
 
Hey, did you ever end up getting anywhere with this project? I, too, have been inspired by my economics courses to use similar metrics to analyze balance/centralization with the goal of finding a better way to resolve suspect tests.
 

Users Who Are Viewing This Thread (Users: 1, Guests: 0)

Top