As has been remarked upon in other threads, Pokemon Showdown has a problem with systemic smurfing incentivized by two main sources: our current suspect test procedure, and OLTs. This smurfing creates bad experiences for genuine new and low-ladder players, wildly inconsistent experiences for suspect test participants ranging from tediously easy to brutally difficult, and significant damage to our weighted usage statistics. Slayer95 has proposed a solution to this forum that aims to the symptoms of smurfing; that proposal spawned from a discussion on Discord where he also suggested another idea which I worked with him to flesh out: an idea to bypass the creation of new accounts for suspect tests and allow users to qualify with their existing already-ranked accounts by resetting only their Glicko score’s rating deviation which aims to treat one of the causes of major smurfing instead. After further discussion, Chains of Markov and I realized resetting a player’s W/L record (something players can already reset theirselves) would allow existing accounts to work for COIL suspects as well. Having realized that this idea and the one in the previously linked proposal were compatible but not dependent on one another, Slayer and I agreed they could be posted separately.
Allow me to explain some of the systems already at work. Suspect tests nowadays frequently set a minimum GXE as a requirement, with other requirements like minimum Elo or number of games; see OU’s two most recent suspects of Gliscor and Kyurem. GXE is an estimation of the odds of a given player beating an average one and is derived from a user’s Glicko score, which itself is in two parts: a rating and a rating deviation, i.e. 1831 ± 37. By Glickman’s design, a player’s rating deviation, or RD for short, decreases as they play games, and increases gradually over time. On Pokémon Showdown, a player who is inactive for one year (365 days to be exact) will have their RD increase to the maximum of 130, indicating that their rating is as uncertain as that of a new player’s. GXE represents the calculated odds a given player has of beating a new player with the default Glicko score and serves as an accurate estimation of their odds of beating the average player on the ladder. It increases as a player’s Glicko rating increases, and if their rating is over 1500, then GXE increases as their RD gets smaller. Additionally, players with over 100 RD are hidden from PS’s ladder rankings.
With all that in mind, resetting a player’s RD has the following effects:
I am proposing that, when a suspect test is in progress, a player be allowed to use a command to reset the W/L and RD on their existing account. They would then play games starting at their current matchmaking level, and after playing a few games (usually around 5), their RD would reach 100 again, restoring them to the ladder ranking and giving them a non-provisional Glicko rating and thus a GXE. While their GXE will likely be lower than it was before the reset, this alone may be enough to qualify a user who met the qualifications already with some GXE to spare. As such, leaders issuing suspect tests with GXE as the main requirement may want to set additional requirements. To that end, COIL already functions as a dynamic minimum game requirement and would work very well in this system.
At this point, you may be wondering if this actually makes qualifying any easier for high ranking players compared to starting from a new alt if game minimums are required anyway. To answer that, consider a suspect test with a target COIL of 2900 and a B value of 4, both commonly used values today. A player with 85.0% GXE would only need to have played 18 games to hit that COIL target, but reaching 85.0% GXE from a fresh alt in 18 games is practically impossible. By allowing players to start the test with their existing accounts, well-qualified players can be rewarded for their recent skill and activity with a shorter path to qualification that doesn’t have them slogging their way through low and mid ladder. It also provides players closer to the qualification threshold a chance to prove themselves without worrying about an early loss to a top 50 player killing that attempt’s chance at qualifying. To demonstrate this, I simulated a real 1800 Elo 80.0% GXE player with 1762 ± 27 (changed to 1762 ± 130 for the simulation) Glicko from the OU ladder playing against a handful of other players within 50 Elo of them, with a random chance of winning each time calculated according to Glickman’s formula for comparing two Glicko scores (GXE uses this formula to calculate players’ chances at beating a 1500 ± 130 player, which represents a brand new player, to estimate their chance at beating a random player off the ladder). Across 1000 simulations of up to 200 battles each, this 80.0% GXE player was able to reach 2800 COIL in 25 games on average and 2900 COIL in 34 games on average, with only 2 of the 1000 simulations ending with them unable to eventually reach 2900. Compare this to real user sufys’s qualification for the recent PU suspect, where they needed 41 games to reach 2800 COIL with a GXE of 78.9% despite going 35 and 6:
For OU leaders, note that in these simulations, the player was able to re-achieve a non-provisional GXE of at least 80.0% in 19 games on average, with 75% of attempts making it in 19 or fewer games. Keeping in mind that this player is only barely skilled enough to qualify for an 80.0% GXE requirement, I think this demonstrates that this method of testing qualification is accessible to even the weakest qualified players. For other leaders who would prefer players still have to play around 40+ games to qualify, using higher B values for your tests can accomplish this, but remember that the goal of this change is to de-incentivize smurfing. Raising the B value too high could make it stop being quicker to start from one’s main account rather than a fresh one.
If this all sounds too good to be true, you might be asking why resetting RD is even necessary at all. Indeed, resetting W/L is all that’s needed for an existing account to participate in a COIL test. However, consider the case of extremely over-qualified players, like those above 90 GXE. Conceivably, a high GXE player could lose most or even all of their games and still qualify for an Elo+GXE or COIL suspect test after 12 or so games. This is why RD matters. Resetting RD makes Glicko more volatile, so even very highly ranked players could still lose a lot of rating and thus GXE from losing games, forcing them to win at least some games to qualify. To demonstrate, here are some simulations I ran with this code:
Real OU player adriano spadoto without their RD reset, simulated phoning it in against a mock ladder winning only one in 7 games and still qualifying for a 3000 COIL requirement.
The same player with the same W/L but this time with their RD reset to 130 before they start
As you can see, resetting RD forces even highly skilled players with well-established rankings to at least try to earn a COIL requirement. This should keep things fair to lower-ranked players and engaging for higher-ranked players. There’s also the rare unban suspect tests to consider: in gen 8, Zamazenta-Crowned was temporarily unbanned from OU for the duration of its suspect test, so players rankings from before that test started genuinely were less accurate to the state of the metagame during the test. It may seem odd to argue of doing something every suspect for the sake of rare outliers—the only unban suspect in an official tier so far this gen was Indeedee-M in NU—but I would argue that their rarity makes it all the more important to do it every time so leaders don’t forget it’s even an option.
The proposed new process for players would be as follows:
Please ask if there’s anything you want me to clarify, and please comment if you think there might be some method of abusing this system I haven’t covered. Thank you.
Allow me to explain some of the systems already at work. Suspect tests nowadays frequently set a minimum GXE as a requirement, with other requirements like minimum Elo or number of games; see OU’s two most recent suspects of Gliscor and Kyurem. GXE is an estimation of the odds of a given player beating an average one and is derived from a user’s Glicko score, which itself is in two parts: a rating and a rating deviation, i.e. 1831 ± 37. By Glickman’s design, a player’s rating deviation, or RD for short, decreases as they play games, and increases gradually over time. On Pokémon Showdown, a player who is inactive for one year (365 days to be exact) will have their RD increase to the maximum of 130, indicating that their rating is as uncertain as that of a new player’s. GXE represents the calculated odds a given player has of beating a new player with the default Glicko score and serves as an accurate estimation of their odds of beating the average player on the ladder. It increases as a player’s Glicko rating increases, and if their rating is over 1500, then GXE increases as their RD gets smaller. Additionally, players with over 100 RD are hidden from PS’s ladder rankings.
With all that in mind, resetting a player’s RD has the following effects:
- Their Elo rating remains untouched, so their matchmaking is unaffected
- In the current system, anyway. With Slayer’s proposed Glicko-based matchmaking, players who reset their RD would mostly get matched with other players who had also reset their RD. Players starting an attempt at qualifying for a suspect already frequently get matched with other players doing the same, and while that proposal would cut established bad players out of their matchmaking pool, combining this one with it would further eliminate genuine new players
- They can earn it back quickly with consistent play at their level, which I will show further on
I am proposing that, when a suspect test is in progress, a player be allowed to use a command to reset the W/L and RD on their existing account. They would then play games starting at their current matchmaking level, and after playing a few games (usually around 5), their RD would reach 100 again, restoring them to the ladder ranking and giving them a non-provisional Glicko rating and thus a GXE. While their GXE will likely be lower than it was before the reset, this alone may be enough to qualify a user who met the qualifications already with some GXE to spare. As such, leaders issuing suspect tests with GXE as the main requirement may want to set additional requirements. To that end, COIL already functions as a dynamic minimum game requirement and would work very well in this system.
At this point, you may be wondering if this actually makes qualifying any easier for high ranking players compared to starting from a new alt if game minimums are required anyway. To answer that, consider a suspect test with a target COIL of 2900 and a B value of 4, both commonly used values today. A player with 85.0% GXE would only need to have played 18 games to hit that COIL target, but reaching 85.0% GXE from a fresh alt in 18 games is practically impossible. By allowing players to start the test with their existing accounts, well-qualified players can be rewarded for their recent skill and activity with a shorter path to qualification that doesn’t have them slogging their way through low and mid ladder. It also provides players closer to the qualification threshold a chance to prove themselves without worrying about an early loss to a top 50 player killing that attempt’s chance at qualifying. To demonstrate this, I simulated a real 1800 Elo 80.0% GXE player with 1762 ± 27 (changed to 1762 ± 130 for the simulation) Glicko from the OU ladder playing against a handful of other players within 50 Elo of them, with a random chance of winning each time calculated according to Glickman’s formula for comparing two Glicko scores (GXE uses this formula to calculate players’ chances at beating a 1500 ± 130 player, which represents a brand new player, to estimate their chance at beating a random player off the ladder). Across 1000 simulations of up to 200 battles each, this 80.0% GXE player was able to reach 2800 COIL in 25 games on average and 2900 COIL in 34 games on average, with only 2 of the 1000 simulations ending with them unable to eventually reach 2900. Compare this to real user sufys’s qualification for the recent PU suspect, where they needed 41 games to reach 2800 COIL with a GXE of 78.9% despite going 35 and 6:
For OU leaders, note that in these simulations, the player was able to re-achieve a non-provisional GXE of at least 80.0% in 19 games on average, with 75% of attempts making it in 19 or fewer games. Keeping in mind that this player is only barely skilled enough to qualify for an 80.0% GXE requirement, I think this demonstrates that this method of testing qualification is accessible to even the weakest qualified players. For other leaders who would prefer players still have to play around 40+ games to qualify, using higher B values for your tests can accomplish this, but remember that the goal of this change is to de-incentivize smurfing. Raising the B value too high could make it stop being quicker to start from one’s main account rather than a fresh one.
If this all sounds too good to be true, you might be asking why resetting RD is even necessary at all. Indeed, resetting W/L is all that’s needed for an existing account to participate in a COIL test. However, consider the case of extremely over-qualified players, like those above 90 GXE. Conceivably, a high GXE player could lose most or even all of their games and still qualify for an Elo+GXE or COIL suspect test after 12 or so games. This is why RD matters. Resetting RD makes Glicko more volatile, so even very highly ranked players could still lose a lot of rating and thus GXE from losing games, forcing them to win at least some games to qualify. To demonstrate, here are some simulations I ran with this code:
Real OU player adriano spadoto without their RD reset, simulated phoning it in against a mock ladder winning only one in 7 games and still qualifying for a 3000 COIL requirement.
Code:
adriano spadoto:
91.5% - 1954 +- 61 after 0 games | NaN COIL
91.9% - 1964 +- 60 after 1 games | W | 229.75 COIL
91.5% - 1954 +- 59 after 2 games | L | 915 COIL
91.1% - 1945 +- 58 after 3 games | L | 1446.1223583430299 COIL
90.7% - 1935 +- 58 after 4 games | L | 1814 COIL
90.3% - 1926 +- 57 after 5 games | L | 2074.549229124645 COIL
89.9% - 1917 +- 56 after 6 games | L | 2265.338047710982 COIL
89.5% - 1908 +- 56 after 7 games | L | 2409.1613448119174 COIL
89.9% - 1917 +- 55 after 8 games | W | 2542.7559851468245 COIL
89.5% - 1909 +- 54 after 9 games | L | 2630.824741173322 COIL
89.1% - 1900 +- 54 after 10 games | L | 2701.0069215215294 COIL
88.7% - 1892 +- 53 after 11 games | L | 2757.516743861611 COIL
88.3% - 1885 +- 52 after 12 games | L | 2803.3502577758404 COIL
87.9% - 1877 +- 52 after 13 games | L | 2840.692681203238 COIL
87.5% - 1869 +- 51 after 14 games | L | 2871.1737460267327 COIL
87.9% - 1877 +- 51 after 15 games | W | 2922.6324428380417 COIL
87.5% - 1869 +- 50 after 16 games | L | 2943.1374533880007 COIL
87.1% - 1862 +- 50 after 17 games | L | 2959.699091054472 COIL
86.7% - 1855 +- 49 after 18 games | L | 2972.9221325344565 COIL
86.2% - 1848 +- 49 after 19 games | L | 2979.837088034976 COIL
85.8% - 1842 +- 49 after 20 games | L | 2987.729533232298 COIL
85.4% - 1835 +- 48 after 21 games | L | 2993.4969171476573 COIL
85.8% - 1841 +- 48 after 22 games | W | 3025.6211871463447 COIL
Code:
adriano spadoto:
90.5% - 1954 +- 130 after 0 games | | NaN COIL
92.2% - 1994 +- 123 after 1 games | W | 230.5 COIL
90.9% - 1958 +- 117 after 2 games | L | 909 COIL
89.5% - 1924 +- 111 after 3 games | L | 1420.7239415115387 COIL
88.1% - 1893 +- 107 after 4 games | L | 1762 COIL
89.6% - 1922 +- 102 after 5 games | W | 2058.4674521546867 COIL
88.3% - 1895 +- 99 after 6 games | L | 2225.020574114346 COIL
87.0% - 1870 +- 95 after 7 games | L | 2341.8663351802998 COIL
85.6% - 1846 +- 92 after 8 games | L | 2421.1336187827387 COIL
86.9% - 1868 +- 90 after 9 games | W | 2554.398547574991 COIL
85.7% - 1847 +- 87 after 10 games | L | 2597.938194998822 COIL
84.4% - 1827 +- 85 after 11 games | L | 2623.8378036293116 COIL
83.0% - 1807 +- 83 after 12 games | L | 2635.085746267211 COIL
84.3% - 1825 +- 81 after 13 games | W | 2724.350318833139 COIL
83.0% - 1807 +- 79 after 14 games | L | 2723.513381945358 COIL
81.8% - 1789 +- 78 after 15 games | L | 2719.8103961792017 COIL
80.4% - 1772 +- 76 after 16 games | L | 2704.322871455946 COIL
81.6% - 1786 +- 75 after 17 games | W | 2772.8064963265774 COIL
80.3% - 1770 +- 74 after 18 games | L | 2753.46767292407 COIL
79.0% - 1755 +- 73 after 19 games | L | 2730.9411827698736 COIL
77.6% - 1739 +- 71 after 20 games | L | 2702.188948471169 COIL
78.7% - 1751 +- 70 after 21 games | W | 2758.6441145142935 COIL
77.4% - 1736 +- 69 after 22 games | L | 2729.4065254676816 COIL
As you can see, resetting RD forces even highly skilled players with well-established rankings to at least try to earn a COIL requirement. This should keep things fair to lower-ranked players and engaging for higher-ranked players. There’s also the rare unban suspect tests to consider: in gen 8, Zamazenta-Crowned was temporarily unbanned from OU for the duration of its suspect test, so players rankings from before that test started genuinely were less accurate to the state of the metagame during the test. It may seem odd to argue of doing something every suspect for the sake of rare outliers—the only unban suspect in an official tier so far this gen was Indeedee-M in NU—but I would argue that their rarity makes it all the more important to do it every time so leaders don’t forget it’s even an option.
The proposed new process for players would be as follows:
- A suspect test and its requirements are announced, and the test is scheduled
- Once the suspect is officially underway, players are given the ability to use a button/command to participate in the suspect test with their existing alt
- This button/command resets their RD and W/L and stores the date and time when this happens
- If Slayer’s concurrent proposal to switch to Glicko-based matchmaking is accepted, their first few (5 or so) games would be almost exclusively against other players with RD>100, which would mostly be other players of their skill level also participating in the suspect or who haven’t played at all in a while
- If/when they have achieved the requirements of the suspect test, if they’ve linked their account to Smogon (which can be done before, during, or after earning reqs), they’ll see their name in www.smogon.com/tools/suspects and be qualified to vote here on the forums when the time comes
- This is in keeping with current suspect test standards where users may try with as many new alts as they wish. However, I think a limit to how frequently a user can reset their RD is ideal as a user could potentially raise their Glicko faster than intended by repeatedly resetting during a winning streak, or tilt themselves into oblivion by repeatedly resetting during a losing streak, perhaps in an attempt to game the system by starting on a win. Resets could be limited to once-per-day, requires a certain number of games played since the last reset this suspect (5-10 seems like a good range for this), or maybe a combination of both. Manual RD resets would not be available outside of suspects to minimize unintended abuse
Please ask if there’s anything you want me to clarify, and please comment if you think there might be some method of abusing this system I haven’t covered. Thank you.
Last edited: