Mystery Ratings Demystified

Antar · Jan 21, 2014

tl;dr--the two "Mystery Ratings" on the ladders are systems we are testing out for suspect requirement purposes. They are designed so that a score of 2000 will indicate having achieved reqs.

If you've looked at the PS ladders today, you may have noticed two columns besides the familiar Elo, GXE and Glicko marked "Mystery Rating A" and "Mystery Rating B," and you may have asked yourself, "huh?"

Well, these two "mystery ratings" are two systems I personally designed that we're testing out for determining requirements for suspect tests.

In the past, determining reqs. has been a thorny issue, due to the fact that conventional rating systems (Elo, Glicko) are designed to measure skill, not achievement. So often times, the top-rated player has been someone who's, say, gone 20-0, vs. a player with a few losses but with many more games. Does that player deserve reqs? Hell, do they deserve to top the ladder? The consensus has been an emphatic "no." But so if estimated skill alone shouldn't determine player rating, what should?

The simplest thing would be to determine reqs based on W-L ratio and number of battles, but there's a problem with W-L ratio: Pokemon Showdown does matchmaking that attempts to pair players with players of the same skill level as often as possible, so in a perfect world, everyone would have the same win-rate of 50%.

But while we can't use literal win-rates, we can use another measure that we already compute to get at the same idea. GXE is a measure that was developed by Smogon legend X-Act as an alternative ranking system to conservative rating estimates such as ACRE. It represents the percentage odds of you winning a battle against a randomly selected person on the ladder. In other words, GXE corresponds to what we would expect your win-rate to be if we weren't doing any matchmaking. Thus, we use it as a substitute for W-L ratio.

And that leads me directly to our two "mystery ratings," which are actually named COIL and ARMS. They're both calculated entirely based on GXE and number of battles, and they're both designed such that a score of 2000 indicates having met reqs. for most suspect tests. The difference between the two is that, with COIL, your rating eventually converges to a fixed value, while with ARMS, the number of points you gain from a victory never decreases, as you have noticed happening with ACRE, Elo and pretty much any other rating system.

The mathematical details of the two systems follows:

Your Converging Order-Invariant Ladder (COIL) score is calculated based strictly on your GXE and the number of battles you've played. The formula is the following:

C=40*GXE*2^(-B/N)

where B is a threshold set prior to the suspect test corresponding to the minimum number of battles required to achieve reqs. Reqs is explicitly defined as a C of 2000, and the number of battles needed to reach reqs is a function of your GXE:

N=B/(1+log2(GXE/100))

So for a B of 40, player with a GXE of 90 will require 48 battles to meet reqs, and a player with a GXE of 75 will require 69. A player with GXE of 60 will need 152 battles to reach reqs, and a player with GXE of 50 never will.

In the long-term, a player's rating will converge over time to 40 times their GXE. So the very top players may end up with a COIL of around 4000, while good players will end up with GXEs around 3000.

The formula for ARMS (Achievement Rating Measure [Simulated]) is quite a bit simpler, though foundationally the idea is the same, basing score entirely off of GXE and number of battles played:

A=1000+(2*GXE-100)*N/4

Note that, unlike with COIL, there is no parameter that needs to be set in advance. The only thing that will change between suspect tests is the value of A needed to satisfy reqs. Assume that reqs correspond to an A of 2000. Then the formula for number of battles needed to get reqs is:

N=4000/(2*GXE-100)

which means...

a player with GXE=100 needs 40 battles to get reqs
GXE=90 : 50
GXE=75: 80
GXE=60: 200
GXE<=50: infinite

So under ARMS, it takes a bit more battles to get reqs for non-perfect players compared to under COIL. However, the nice thing is that, if a tiering council leader decides after the fact that the bar was set too high, he or she can simply lower it, from 2000 to, say 1800, which corresponds to:

32/40/64/160/infinite battles needed to get reqs for the above GXEs.

One problem with ARMS is that your rating never levels off--the more you play (as long as GXE>50), the higher your score. This makes assigning objective achievement levels somewhat difficult. In addition, in the long-term, new players will be unable to catch up ratings-wise to players who have been playing for years. Also, ARMS shares similar problems with COIL, in that losses still increase your score in the short-term, that is, until your Glicko score and GXE update.

If the reaction to ARMS is particularly positive, we actually have a plan to address these concerns by implementing an ARMS-like system as a real ladder score called WARM (Windowed Achievement Rating Measure), which is calculated the following way:

All players start with a WARM of 1000.
Upon winning a battle, you gain a number of points proportional to your opponent's GXE, namely GXE_o/4 (so 25 points for the very best opponent, 13 for your average random opponent)
Upon losing a battle, you lose a number of points proportional to 100 minus your opponent's GXE, namely (100-GXE_o)/4.
If your WARM falls below 1000, it is reset back to 1000.
The "W" part of WARM refers to the fact that only 28 days of battles are tracked, so (1) inactive players will fall off the ladder relatively quickly, (2) new players will be able to catch up to old players and (3) ratings won't just continue growing off until infinity.

At the end of the day, the primary purpose of these rating systems is to facilitate determining reqs. In the sections above, I give a list of the number of battles needed to achieve reqs (reach a score of 2000) under each system.

In the below graph, the x-axis represents GXE score, and the y-axis represents number of battles you've had.

If you find the spot on the graph that corresponds to your GXE and your number of battles, and it's above the curves, then you've achieved reqs--it's as simple as that.

Note that ARMS reqs are harder to meet than COIL reqs, but this can of course be adjusted by changing some parameters. The idea here is mainly to give you a rough idea of how these things work.

So now the next step is to have some suspect tests, and for players and tiering heads to try out each system and decide which they like better. We're only going to keep one, so in a couple of weeks I'll be soliciting feedback about which system people like better. In the meantime, ignore the "mystery ratings" or start tracking them and getting a feel for how they work. I will provide CSV data of the complete ladder to anyone who wants it upon request.

Mr.E · Jan 21, 2014

Mystery Rating is my personal rating system of what I think about each individual battler.

Vryheid · Jan 21, 2014

"A=1000+(2*GXE-100)*N/4"

Assuming a GXE of 40 and someone who's finished 500 battles, I come out to an ARMS value of -1500, which clearly doesn't make any sense. Can you please explain this equation a little more, or am I missing something here?

Antar · Jan 21, 2014

Vryheid: a GXE of 40% indicates that, paired against random opponents, you would only win 40% of your battles, meaning you'll lose more often than you win. And yes, you'll end up with a very negative rating if you play for long enough. And no, you'll never achieve suspect reqs no matter how much you play.

All of this is by design. Note that your COIL score will asymptote to 1600, which is not negative, but it's also not enough to ever get reqs.

Vryheid · Jan 21, 2014

Antar said:
Vryheid: a GXE of 40% indicates that, paired against random opponents, you would only win 40% of your battles, meaning you'll lose more often than you win. And yes, you'll end up with a very negative rating if you play for long enough. And no, you'll never achieve suspect reqs no matter how much you play.

All of this is by design. Note that your COIL score will asymptote to 1600, which is not negative, but it's also not enough to ever get reqs.

Alright, that makes sense. You could theoretically go infinitely high as long as you kept your win rate above 50%, interesting. I have one other question about the COIL formula:

C=40*GXE*2^(-B/N)

where B is a threshold set prior to the suspect test corresponding to the minimum number of battles required to achieve reqs. Reqs is explicitly defined as a C of 2000, and the number of battles needed to reach reqs is a function of your GXE:

By "Reqs is explicitly defined as a C of 2000" do you mean that Reqs is a constant separate from the above formula, or that "C" in the equation above literally is 2000? I think the former would make more sense but I'd like to make sure.

Antar · Jan 21, 2014

Vryheid said:
By "Reqs is explicitly defined as a C of 2000" do you mean that Reqs is a constant separate from the above formula, or that "C" in the equation above literally is 2000? I think the former would make more sense but I'd like to make sure.

I mean that with ARMS, if you want to change how difficult it is to get reqs, all you have to do is move the cutoff, but with COIL, you can't just say, "fuck it--we're gonna use a cutoff of 1800" because then weird things will happen, like players with GXEs of 45 getting reqs.

UltiMario · Jan 21, 2014

I'm going to comment on one thing I saw with the various ladder systems from someone I battled. Removing name because I'm going to try and minimize the callouts.

1540 71.4 1673 ± 28 2414.2 2765.5

This person had 98-67 w/l at time, less than 1.5 w/l. It's not a good ratio, and based on his play and his team, I would not say this is someone I would want voting in the ladder system, but they've overachieved reqs by a incredibly large amount by COIL or ARMS, as well as having a more-than-respectable 1540 elo despite not being a player that really warrants the rank. Their COIL and ARMS is also significantly higher than most players closer to the top of the ladder, probably due to the sheer number of games. There are also people who definitely would make reqs, but their numbers simply don't agree with COIL or ARMS:

1717 84.7 1825 ± 50 1901.5 1832.8

Yes, I like a system where we award people for playing more, where ACRE promoted not playing on an account after you reached your goal score. But these systems overly benefit those who play more, leading to people who are fairly awful getting to very high on the ladder and making reqs. The ladder will continue to inflate due to this, which leaves us in the same position ACRE left us.

With the current system in place, I can only ever imagine something like the way doubles did its suspect reqs (# of games + good w/l) from being viable, since none of these scores seem valid unless you go by an extremely high elo that's impossible to reach by dumb luck like 1700 (and that's only for THIS month, maybe next month that big number will be 1900), which alienates most of the community from even having a chance to vote.

I don't know what the fix it, I don't know if there is a good fix, but imo what we're going right now isn't giving us any better results than ACRE is, and it doesn't look like COIL or ARMS is going to either. Glicko-1 is even looking better than everything else at this point, and I just don't find that to be a good thing.

Antar · Jan 21, 2014

So, as I keep saying, W-L ratio is meaningless. Ideally everyone would have the same W-L ratio of 50%.

A GXE of 71.4 is quite good, and to get a Glicko RD of 28 means they must have had a metric ton of battles.

No one is (currently) suggesting that we rank the ladder in terms of COIL or ARMS. These are merely metrics for determining suspect reqs, and from what I've heard from tiering leaders, participation is more important than skill.

If I hear from tiering leaders that skill level is not being weighted highly enough, that can be adjusted, but so far I've heard nothing of the sort, and I've been bandying about ideas with them for months now.

TheDuckChris · Jan 22, 2014

I'm not understanding where the amount of battles you have played comes in. As i see it, N is calculated beforehand based on your GXE (which changes i think?). Doesn't this mean both rankings would just stay constant except for when your GXE changes? I understand the part about how many battles you need to play to get rankings, I'm just unsure how it fits in the equation.

Slayer95 · Jan 22, 2014

ChrisTehAwesome said:
I'm not understanding where the amount of battles you have played comes in. As i see it, N is calculated beforehand based on your GXE (which changes i think?). Doesn't this mean both rankings would just stay constant except for when your GXE changes? I understand the part about how many battles you need to play to get rankings, I'm just unsure how it fits in the equation.

N is not calculated beforehand. The expression with N on the left side turns out from solving the equation to get the minimal N so that C >= 2000 and B = 40.

Of course, if you replace such amount of battles in the definition of the COIL, you will get the threshold score of 2000. However, for any other N, you will get a different C.

TheDuckChris · Jan 22, 2014

hmm okay so N is just number of battles played, that makes sense. So those numbers then are just estimates of how many battles you would need to play for a given GXE?

Follow up question: does GXE eventually level off or will it keep changing?

Antar · Jan 23, 2014

GXE is a measure of skill. So if you get better (or worse) as a player or change your team, your GXE will change to reflect your new skill level.

fish6067 · Jan 24, 2014

I don't really understand the mathematical stuff all too well but I had an idea which I and a few other people in the studio thought would be an awesome system in the ladder. Now I don't have any formulas because like I said the mathematical stuff confuses me but I'll just try to describe it:

What if there is a rating system we base on... lets say ELO for the sake of the example. Now, the concern is elo isn't favouring someone who has a really large streak (Antar said 20-0). I suggest PS implements a rating system that deviates positively or negitively as you build a longer streak of wins and losses respectively. So, if you get a win streak of 4, you will get slightly more points than normal (maybe 1.1x or something small), but if it gets to 7 or 8 you might get 1.5x what you would've gotten and so on. Now I'm not sure if this would lead to inflation or not and how big the multipliers would have to be, but if some mathematical wiz could put this into a formula to calculate skill I'd be really impressed! Of course, it may not be possible... :(

Oh yeah sorry if this isn't really the right place to post this but it seemed as good as any

Antar · Jan 24, 2014

fish6067, I'm not sure why you'd want to reward players based on luck (which is what a winning streak is in a haxxy game like Pokemon), but you can design rating systems until the cows come home. My approach to rating systems is that they either measure skill (which is grounded in solid math and probability theory) or they measure achievement, which is what these "mystery" rating systems are designed to do. I don't see how your system does either.

Mr.E · Jan 25, 2014

I'm no tiering leader but I would say skill is more important than participation with regard to suspect testing. Metagame balance should not be decided by average or slightly-above-average Joes who happened to play enough games to get a vote, they still lack the understanding necessary to distinguish things that simply require good and fair counterplay from true imbalance. Participation should be more rewarded for everyday ladder play and ranking.

Incidentally this seems to be the reverse of how the current ladder is set up, unless I'm dyslexic as all hell.

Red Cat · Jan 25, 2014

Mr.E said:
I'm no tiering leader but I would say skill is more important than participation with regard to suspect testing. Metagame balance should not be decided by average or slightly-above-average Joes who happened to play enough games to get a vote, they still lack the understanding necessary to distinguish things that simply require good and fair counterplay from true imbalance. Participation should be more rewarded for everyday ladder play and ranking.

I disagree with that statement. Voting privileges in suspect tests should not be a reward for being one of the best players. Anyone who displays sufficient competence and experience in battling should be able to vote. For suspect tests, experience is much more important since you have to play many teams which have the suspect and many teams which do not have the suspect to be able to make an informed judgement whether the suspect is over-centralizing or not. Skill does have some importance; if you are unwilling to improve your team or battling skills then most top threats will seem broken to you and you won't be able to make an informed decision. However, just because you are an expert at something does not mean you own a monopoly on the knowledge and ideas of that subject. A player with an extremely high winning percentage over fewer battles is not necessarily able to make an informed and unbiased decision on whether something should be banned or not. Suspect tests do not just affect the top of the latter; they affect everyone on the latter. As a result, suspect tests should allow a variety of skill levels to have a say. I think ARMS would be the better rating system for these reasons because ARMS is more flexible and values experience more.

Mr.E · Jan 26, 2014

Good point, a player should have a sufficient number of battles to ensure proper exposure to the suspect in question. That said, a player should still need to display a high degree of competence rather than simply being average. The opinions of lower ladder players is, if you actually care about competitive balance, not relevant precisely because these are most likely the people to which "most top threats will seem broken to you and you won't be able to make an informed decision." That's speaking as a pure competitor, though, it's not necessarily the best way of cultivating and maintaining the largest possible community.

Red Cat · Jan 26, 2014

Mr.E said:
Good point, a player should have a sufficient number of battles to ensure proper exposure to the suspect in question. That said, a player should still need to display a high degree of competence rather than simply being average. The opinions of lower ladder players is, if you actually care about competitive balance, not relevant precisely because these are most likely the people to which "most top threats will seem broken to you and you won't be able to make an informed decision." That's speaking as a pure competitor, though, it's not necessarily the best way of cultivating and maintaining the largest possible community.

Since the distribution of player skills on the ladder is probably skewed with there being many more bad players than good, I agree that the "average" player is probably not very good. I'm not suggesting that people who use teams full of NU pokemon should be allowed to vote, but players who have shown that they are good battlers over many battles, but not necessarily great, should be allowed to vote. Anyway, it looks like the tentative threshold for the first OU suspect test is an Elo of 1700 with a Glicko 1 deviation less than 50. I think it is a good system, but we will see if 1700 is too high or not.

MJB · Jan 27, 2014

You guys have to bear in mind that the number of games needed to achieve reqs approaches infinity the closer you are to a 50% winrate, which when you factor in matchmaking will be everyone except the top of the ladder, as they will have far fewer equally skilled players to be matched with and so will be paired with weaker players.With the old ladder system (glicko2) it was rare for people under 1800 to have over a 50% winrate, which was about the top 10% or so of players iirc

Antar · Jan 27, 2014

Incorrect, MJB. The number of battles tends to infinity as GXE, which is predicted winrate on a ladder without matchmaking, tends to 50.

DontStealMyPenguin · Feb 26, 2014

Has decay just been added??

Antar · Feb 26, 2014

DontStealMyPenguin, the "mystery ratings" are being discussed now by us muckety-mucks, but no--we'll likely be going a different route with these systems instead of adding decay.

Edit: to be clear--"these systems" refers to the Mystery Ratings, not to Glicko or Elo.

DontStealMyPenguin · Feb 26, 2014

My score went down by 10 points last night though. And the person below me's score went down by nine. None of us battled. Because if any of us did, we would have lost many more points as I'm 1st and the other guy was 2nd (ubers ladder).

Antar · Feb 26, 2014

DontStealMyPenguin

Your Elo you mean? That has nothing to do with this thread. If you mean your "mystery rating," then it's possible, because "mystery ratings" are based off of GXE which *does* experience *some* decay.

Edit: Elo is the number on the far left that the ladder is sorted by.

DontStealMyPenguin · Feb 26, 2014

Yeah sorry, it was my ELO that decayed

Mystery Ratings Demystified

Antar

Mr.E

unban me from Discord

Vryheid

fudge jelly

Antar

Vryheid

fudge jelly

Antar

UltiMario

Out of Obscurity

Antar

TheDuckChris

replay watcher

Slayer95

TheDuckChris

replay watcher

Antar

fish6067

Antar

Mr.E

unban me from Discord

Red Cat

Mr.E

unban me from Discord

Red Cat

MJB

Sup Peeps

Antar

DontStealMyPenguin

Antar

DontStealMyPenguin

Antar

DontStealMyPenguin