tl;dr--the two "Mystery Ratings" on the ladders are systems we are testing out for suspect requirement purposes. They are designed so that a score of 2000 will indicate having achieved reqs.
If you've looked at the PS ladders today, you may have noticed two columns besides the familiar Elo, GXE and Glicko marked "Mystery Rating A" and "Mystery Rating B," and you may have asked yourself, "huh?"
Well, these two "mystery ratings" are two systems I personally designed that we're testing out for determining requirements for suspect tests.
In the past, determining reqs. has been a thorny issue, due to the fact that conventional rating systems (Elo, Glicko) are designed to measure skill, not achievement. So often times, the top-rated player has been someone who's, say, gone 20-0, vs. a player with a few losses but with many more games. Does that player deserve reqs? Hell, do they deserve to top the ladder? The consensus has been an emphatic "no." But so if estimated skill alone shouldn't determine player rating, what should?
The simplest thing would be to determine reqs based on W-L ratio and number of battles, but there's a problem with W-L ratio: Pokemon Showdown does matchmaking that attempts to pair players with players of the same skill level as often as possible, so in a perfect world, everyone would have the same win-rate of 50%.
But while we can't use literal win-rates, we can use another measure that we already compute to get at the same idea. GXE is a measure that was developed by Smogon legend X-Act as an alternative ranking system to conservative rating estimates such as ACRE. It represents the percentage odds of you winning a battle against a randomly selected person on the ladder. In other words, GXE corresponds to what we would expect your win-rate to be if we weren't doing any matchmaking. Thus, we use it as a substitute for W-L ratio.
And that leads me directly to our two "mystery ratings," which are actually named COIL and ARMS. They're both calculated entirely based on GXE and number of battles, and they're both designed such that a score of 2000 indicates having met reqs. for most suspect tests. The difference between the two is that, with COIL, your rating eventually converges to a fixed value, while with ARMS, the number of points you gain from a victory never decreases, as you have noticed happening with ACRE, Elo and pretty much any other rating system.
The mathematical details of the two systems follows:
At the end of the day, the primary purpose of these rating systems is to facilitate determining reqs. In the sections above, I give a list of the number of battles needed to achieve reqs (reach a score of 2000) under each system.
In the below graph, the x-axis represents GXE score, and the y-axis represents number of battles you've had.
If you find the spot on the graph that corresponds to your GXE and your number of battles, and it's above the curves, then you've achieved reqs--it's as simple as that.
Note that ARMS reqs are harder to meet than COIL reqs, but this can of course be adjusted by changing some parameters. The idea here is mainly to give you a rough idea of how these things work.
So now the next step is to have some suspect tests, and for players and tiering heads to try out each system and decide which they like better. We're only going to keep one, so in a couple of weeks I'll be soliciting feedback about which system people like better. In the meantime, ignore the "mystery ratings" or start tracking them and getting a feel for how they work. I will provide CSV data of the complete ladder to anyone who wants it upon request.
If you've looked at the PS ladders today, you may have noticed two columns besides the familiar Elo, GXE and Glicko marked "Mystery Rating A" and "Mystery Rating B," and you may have asked yourself, "huh?"
Well, these two "mystery ratings" are two systems I personally designed that we're testing out for determining requirements for suspect tests.
In the past, determining reqs. has been a thorny issue, due to the fact that conventional rating systems (Elo, Glicko) are designed to measure skill, not achievement. So often times, the top-rated player has been someone who's, say, gone 20-0, vs. a player with a few losses but with many more games. Does that player deserve reqs? Hell, do they deserve to top the ladder? The consensus has been an emphatic "no." But so if estimated skill alone shouldn't determine player rating, what should?
The simplest thing would be to determine reqs based on W-L ratio and number of battles, but there's a problem with W-L ratio: Pokemon Showdown does matchmaking that attempts to pair players with players of the same skill level as often as possible, so in a perfect world, everyone would have the same win-rate of 50%.
But while we can't use literal win-rates, we can use another measure that we already compute to get at the same idea. GXE is a measure that was developed by Smogon legend X-Act as an alternative ranking system to conservative rating estimates such as ACRE. It represents the percentage odds of you winning a battle against a randomly selected person on the ladder. In other words, GXE corresponds to what we would expect your win-rate to be if we weren't doing any matchmaking. Thus, we use it as a substitute for W-L ratio.
And that leads me directly to our two "mystery ratings," which are actually named COIL and ARMS. They're both calculated entirely based on GXE and number of battles, and they're both designed such that a score of 2000 indicates having met reqs. for most suspect tests. The difference between the two is that, with COIL, your rating eventually converges to a fixed value, while with ARMS, the number of points you gain from a victory never decreases, as you have noticed happening with ACRE, Elo and pretty much any other rating system.
The mathematical details of the two systems follows:
Your Converging Order-Invariant Ladder (COIL) score is calculated based strictly on your GXE and the number of battles you've played. The formula is the following:
C=40*GXE*2^(-B/N)
where B is a threshold set prior to the suspect test corresponding to the minimum number of battles required to achieve reqs. Reqs is explicitly defined as a C of 2000, and the number of battles needed to reach reqs is a function of your GXE:
N=B/(1+log2(GXE/100))
So for a B of 40, player with a GXE of 90 will require 48 battles to meet reqs, and a player with a GXE of 75 will require 69. A player with GXE of 60 will need 152 battles to reach reqs, and a player with GXE of 50 never will.
In the long-term, a player's rating will converge over time to 40 times their GXE. So the very top players may end up with a COIL of around 4000, while good players will end up with GXEs around 3000.
C=40*GXE*2^(-B/N)
where B is a threshold set prior to the suspect test corresponding to the minimum number of battles required to achieve reqs. Reqs is explicitly defined as a C of 2000, and the number of battles needed to reach reqs is a function of your GXE:
N=B/(1+log2(GXE/100))
So for a B of 40, player with a GXE of 90 will require 48 battles to meet reqs, and a player with a GXE of 75 will require 69. A player with GXE of 60 will need 152 battles to reach reqs, and a player with GXE of 50 never will.
In the long-term, a player's rating will converge over time to 40 times their GXE. So the very top players may end up with a COIL of around 4000, while good players will end up with GXEs around 3000.
The formula for ARMS (Achievement Rating Measure [Simulated]) is quite a bit simpler, though foundationally the idea is the same, basing score entirely off of GXE and number of battles played:
A=1000+(2*GXE-100)*N/4
Note that, unlike with COIL, there is no parameter that needs to be set in advance. The only thing that will change between suspect tests is the value of A needed to satisfy reqs. Assume that reqs correspond to an A of 2000. Then the formula for number of battles needed to get reqs is:
N=4000/(2*GXE-100)
which means...
32/40/64/160/infinite battles needed to get reqs for the above GXEs.
One problem with ARMS is that your rating never levels off--the more you play (as long as GXE>50), the higher your score. This makes assigning objective achievement levels somewhat difficult. In addition, in the long-term, new players will be unable to catch up ratings-wise to players who have been playing for years. Also, ARMS shares similar problems with COIL, in that losses still increase your score in the short-term, that is, until your Glicko score and GXE update.
If the reaction to ARMS is particularly positive, we actually have a plan to address these concerns by implementing an ARMS-like system as a real ladder score called WARM (Windowed Achievement Rating Measure), which is calculated the following way:
A=1000+(2*GXE-100)*N/4
Note that, unlike with COIL, there is no parameter that needs to be set in advance. The only thing that will change between suspect tests is the value of A needed to satisfy reqs. Assume that reqs correspond to an A of 2000. Then the formula for number of battles needed to get reqs is:
N=4000/(2*GXE-100)
which means...
- a player with GXE=100 needs 40 battles to get reqs
- GXE=90 : 50
- GXE=75: 80
- GXE=60: 200
- GXE<=50: infinite
32/40/64/160/infinite battles needed to get reqs for the above GXEs.
One problem with ARMS is that your rating never levels off--the more you play (as long as GXE>50), the higher your score. This makes assigning objective achievement levels somewhat difficult. In addition, in the long-term, new players will be unable to catch up ratings-wise to players who have been playing for years. Also, ARMS shares similar problems with COIL, in that losses still increase your score in the short-term, that is, until your Glicko score and GXE update.
If the reaction to ARMS is particularly positive, we actually have a plan to address these concerns by implementing an ARMS-like system as a real ladder score called WARM (Windowed Achievement Rating Measure), which is calculated the following way:
- All players start with a WARM of 1000.
- Upon winning a battle, you gain a number of points proportional to your opponent's GXE, namely GXE_o/4 (so 25 points for the very best opponent, 13 for your average random opponent)
- Upon losing a battle, you lose a number of points proportional to 100 minus your opponent's GXE, namely (100-GXE_o)/4.
- If your WARM falls below 1000, it is reset back to 1000.
- The "W" part of WARM refers to the fact that only 28 days of battles are tracked, so (1) inactive players will fall off the ladder relatively quickly, (2) new players will be able to catch up to old players and (3) ratings won't just continue growing off until infinity.
At the end of the day, the primary purpose of these rating systems is to facilitate determining reqs. In the sections above, I give a list of the number of battles needed to achieve reqs (reach a score of 2000) under each system.
In the below graph, the x-axis represents GXE score, and the y-axis represents number of battles you've had.

If you find the spot on the graph that corresponds to your GXE and your number of battles, and it's above the curves, then you've achieved reqs--it's as simple as that.
Note that ARMS reqs are harder to meet than COIL reqs, but this can of course be adjusted by changing some parameters. The idea here is mainly to give you a rough idea of how these things work.
So now the next step is to have some suspect tests, and for players and tiering heads to try out each system and decide which they like better. We're only going to keep one, so in a couple of weeks I'll be soliciting feedback about which system people like better. In the meantime, ignore the "mystery ratings" or start tracking them and getting a feel for how they work. I will provide CSV data of the complete ladder to anyone who wants it upon request.