Proposal An alternate suspect test procedure that lets users qualify with their main accounts

As has been remarked upon in other threads, Pokemon Showdown has a problem with systemic smurfing incentivized by two main sources: our current suspect test procedure, and OLTs. This smurfing creates bad experiences for genuine new and low-ladder players, wildly inconsistent experiences for suspect test participants ranging from tediously easy to brutally difficult, and significant damage to our weighted usage statistics. Slayer95 has proposed a solution to this forum that aims to the symptoms of smurfing; that proposal spawned from a discussion on Discord where he also suggested another idea which I worked with him to flesh out: an idea to bypass the creation of new accounts for suspect tests and allow users to qualify with their existing already-ranked accounts by resetting only their Glicko score’s rating deviation which aims to treat one of the causes of major smurfing instead. After further discussion, Chains of Markov and I realized resetting a player’s W/L record (something players can already reset theirselves) would allow existing accounts to work for COIL suspects as well. Having realized that this idea and the one in the previously linked proposal were compatible but not dependent on one another, Slayer and I agreed they could be posted separately.

Allow me to explain some of the systems already at work. Suspect tests nowadays frequently set a minimum GXE as a requirement, with other requirements like minimum Elo or number of games; see OU’s two most recent suspects of Gliscor and Kyurem. GXE is an estimation of the odds of a given player beating an average one and is derived from a user’s Glicko score, which itself is in two parts: a rating and a rating deviation, i.e. 1831 ± 37. By Glickman’s design, a player’s rating deviation, or RD for short, decreases as they play games, and increases gradually over time. On Pokémon Showdown, a player who is inactive for one year (365 days to be exact) will have their RD increase to the maximum of 130, indicating that their rating is as uncertain as that of a new player’s. GXE represents the calculated odds a given player has of beating a new player with the default Glicko score and serves as an accurate estimation of their odds of beating the average player on the ladder. It increases as a player’s Glicko rating increases, and if their rating is over 1500, then GXE increases as their RD gets smaller. Additionally, players with over 100 RD are hidden from PS’s ladder rankings.

With all that in mind, resetting a player’s RD has the following effects:
  • Their Elo rating remains untouched, so their matchmaking is unaffected
    • In the current system, anyway. With Slayer’s proposed Glicko-based matchmaking, players who reset their RD would mostly get matched with other players who had also reset their RD. Players starting an attempt at qualifying for a suspect already frequently get matched with other players doing the same, and while that proposal would cut established bad players out of their matchmaking pool, combining this one with it would further eliminate genuine new players
    • They can earn it back quickly with consistent play at their level, which I will show further on
COIL is a function of GXE and games played. The higher your GXE and the more games you’ve played, the higher your COIL will be. When calculating COIL, PS simply adds up a player’s W/L record (the one on the stats page that players can already reset for themselves manually) to get the total number of games played. COIL is customizable, and the tier leader setting the requirement can define a B value that changes how it is calculated, with higher B values requiring more games to qualify in general and lower B values increasing the minimum GXE needed to qualify at all. Modern suspect tests usually use a B value of 3 or 4. Unlike with RD, resetting an existing player’s W/L at the start of an attempt at qualifying for a suspect test has essentially no effect on anything. Indeed, for as long as I can remember, players have already been allowed to freely reset it.

I am proposing that, when a suspect test is in progress, a player be allowed to use a command to reset the W/L and RD on their existing account. They would then play games starting at their current matchmaking level, and after playing a few games (usually around 5), their RD would reach 100 again, restoring them to the ladder ranking and giving them a non-provisional Glicko rating and thus a GXE. While their GXE will likely be lower than it was before the reset, this alone may be enough to qualify a user who met the qualifications already with some GXE to spare. As such, leaders issuing suspect tests with GXE as the main requirement may want to set additional requirements. To that end, COIL already functions as a dynamic minimum game requirement and would work very well in this system.

At this point, you may be wondering if this actually makes qualifying any easier for high ranking players compared to starting from a new alt if game minimums are required anyway. To answer that, consider a suspect test with a target COIL of 2900 and a B value of 4, both commonly used values today. A player with 85.0% GXE would only need to have played 18 games to hit that COIL target, but reaching 85.0% GXE from a fresh alt in 18 games is practically impossible. By allowing players to start the test with their existing accounts, well-qualified players can be rewarded for their recent skill and activity with a shorter path to qualification that doesn’t have them slogging their way through low and mid ladder. It also provides players closer to the qualification threshold a chance to prove themselves without worrying about an early loss to a top 50 player killing that attempt’s chance at qualifying. To demonstrate this, I simulated a real 1800 Elo 80.0% GXE player with 1762 ± 27 (changed to 1762 ± 130 for the simulation) Glicko from the OU ladder playing against a handful of other players within 50 Elo of them, with a random chance of winning each time calculated according to Glickman’s formula for comparing two Glicko scores (GXE uses this formula to calculate players’ chances at beating a 1500 ± 130 player, which represents a brand new player, to estimate their chance at beating a random player off the ladder). Across 1000 simulations of up to 200 battles each, this 80.0% GXE player was able to reach 2800 COIL in 25 games on average and 2900 COIL in 34 games on average, with only 2 of the 1000 simulations ending with them unable to eventually reach 2900. Compare this to real user sufys’s qualification for the recent PU suspect, where they needed 41 games to reach 2800 COIL with a GXE of 78.9% despite going 35 and 6:
1733633273717.png

For OU leaders, note that in these simulations, the player was able to re-achieve a non-provisional GXE of at least 80.0% in 19 games on average, with 75% of attempts making it in 19 or fewer games. Keeping in mind that this player is only barely skilled enough to qualify for an 80.0% GXE requirement, I think this demonstrates that this method of testing qualification is accessible to even the weakest qualified players. For other leaders who would prefer players still have to play around 40+ games to qualify, using higher B values for your tests can accomplish this, but remember that the goal of this change is to de-incentivize smurfing. Raising the B value too high could make it stop being quicker to start from one’s main account rather than a fresh one.

If this all sounds too good to be true, you might be asking why resetting RD is even necessary at all. Indeed, resetting W/L is all that’s needed for an existing account to participate in a COIL test. However, consider the case of extremely over-qualified players, like those above 90 GXE. Conceivably, a high GXE player could lose most or even all of their games and still qualify for an Elo+GXE or COIL suspect test after 12 or so games. This is why RD matters. Resetting RD makes Glicko more volatile, so even very highly ranked players could still lose a lot of rating and thus GXE from losing games, forcing them to win at least some games to qualify. To demonstrate, here are some simulations I ran with this code:

Real OU player adriano spadoto without their RD reset, simulated phoning it in against a mock ladder winning only one in 7 games and still qualifying for a 3000 COIL requirement.
Code:
adriano spadoto:
91.5% - 1954 +- 61 after 0 games |   NaN COIL
91.9% - 1964 +- 60 after 1 games | W | 229.75 COIL
91.5% - 1954 +- 59 after 2 games | L | 915 COIL
91.1% - 1945 +- 58 after 3 games | L | 1446.1223583430299 COIL
90.7% - 1935 +- 58 after 4 games | L | 1814 COIL
90.3% - 1926 +- 57 after 5 games | L | 2074.549229124645 COIL
89.9% - 1917 +- 56 after 6 games | L | 2265.338047710982 COIL
89.5% - 1908 +- 56 after 7 games | L | 2409.1613448119174 COIL
89.9% - 1917 +- 55 after 8 games | W | 2542.7559851468245 COIL
89.5% - 1909 +- 54 after 9 games | L | 2630.824741173322 COIL
89.1% - 1900 +- 54 after 10 games | L | 2701.0069215215294 COIL
88.7% - 1892 +- 53 after 11 games | L | 2757.516743861611 COIL
88.3% - 1885 +- 52 after 12 games | L | 2803.3502577758404 COIL
87.9% - 1877 +- 52 after 13 games | L | 2840.692681203238 COIL
87.5% - 1869 +- 51 after 14 games | L | 2871.1737460267327 COIL
87.9% - 1877 +- 51 after 15 games | W | 2922.6324428380417 COIL
87.5% - 1869 +- 50 after 16 games | L | 2943.1374533880007 COIL
87.1% - 1862 +- 50 after 17 games | L | 2959.699091054472 COIL
86.7% - 1855 +- 49 after 18 games | L | 2972.9221325344565 COIL
86.2% - 1848 +- 49 after 19 games | L | 2979.837088034976 COIL
85.8% - 1842 +- 49 after 20 games | L | 2987.729533232298 COIL
85.4% - 1835 +- 48 after 21 games | L | 2993.4969171476573 COIL
85.8% - 1841 +- 48 after 22 games | W | 3025.6211871463447 COIL
The same player with the same W/L but this time with their RD reset to 130 before they start
Code:
adriano spadoto:
90.5% - 1954 +- 130 after 0 games |   | NaN COIL
92.2% - 1994 +- 123 after 1 games | W | 230.5 COIL
90.9% - 1958 +- 117 after 2 games | L | 909 COIL
89.5% - 1924 +- 111 after 3 games | L | 1420.7239415115387 COIL
88.1% - 1893 +- 107 after 4 games | L | 1762 COIL
89.6% - 1922 +- 102 after 5 games | W | 2058.4674521546867 COIL
88.3% - 1895 +- 99 after 6 games | L | 2225.020574114346 COIL
87.0% - 1870 +- 95 after 7 games | L | 2341.8663351802998 COIL
85.6% - 1846 +- 92 after 8 games | L | 2421.1336187827387 COIL
86.9% - 1868 +- 90 after 9 games | W | 2554.398547574991 COIL
85.7% - 1847 +- 87 after 10 games | L | 2597.938194998822 COIL
84.4% - 1827 +- 85 after 11 games | L | 2623.8378036293116 COIL
83.0% - 1807 +- 83 after 12 games | L | 2635.085746267211 COIL
84.3% - 1825 +- 81 after 13 games | W | 2724.350318833139 COIL
83.0% - 1807 +- 79 after 14 games | L | 2723.513381945358 COIL
81.8% - 1789 +- 78 after 15 games | L | 2719.8103961792017 COIL
80.4% - 1772 +- 76 after 16 games | L | 2704.322871455946 COIL
81.6% - 1786 +- 75 after 17 games | W | 2772.8064963265774 COIL
80.3% - 1770 +- 74 after 18 games | L | 2753.46767292407 COIL
79.0% - 1755 +- 73 after 19 games | L | 2730.9411827698736 COIL
77.6% - 1739 +- 71 after 20 games | L | 2702.188948471169 COIL
78.7% - 1751 +- 70 after 21 games | W | 2758.6441145142935 COIL
77.4% - 1736 +- 69 after 22 games | L | 2729.4065254676816 COIL

As you can see, resetting RD forces even highly skilled players with well-established rankings to at least try to earn a COIL requirement. This should keep things fair to lower-ranked players and engaging for higher-ranked players. There’s also the rare unban suspect tests to consider: in gen 8, Zamazenta-Crowned was temporarily unbanned from OU for the duration of its suspect test, so players rankings from before that test started genuinely were less accurate to the state of the metagame during the test. It may seem odd to argue of doing something every suspect for the sake of rare outliers—the only unban suspect in an official tier so far this gen was Indeedee-M in NU—but I would argue that their rarity makes it all the more important to do it every time so leaders don’t forget it’s even an option.

The proposed new process for players would be as follows:
  • A suspect test and its requirements are announced, and the test is scheduled
  • Once the suspect is officially underway, players are given the ability to use a button/command to participate in the suspect test with their existing alt
  • This button/command resets their RD and W/L and stores the date and time when this happens
    • If Slayer’s concurrent proposal to switch to Glicko-based matchmaking is accepted, their first few (5 or so) games would be almost exclusively against other players with RD>100, which would mostly be other players of their skill level also participating in the suspect or who haven’t played at all in a while
  • If/when they have achieved the requirements of the suspect test, if they’ve linked their account to Smogon (which can be done before, during, or after earning reqs), they’ll see their name in www.smogon.com/tools/suspects and be qualified to vote here on the forums when the time comes
    • This is in keeping with current suspect test standards where users may try with as many new alts as they wish. However, I think a limit to how frequently a user can reset their RD is ideal as a user could potentially raise their Glicko faster than intended by repeatedly resetting during a winning streak, or tilt themselves into oblivion by repeatedly resetting during a losing streak, perhaps in an attempt to game the system by starting on a win. Resets could be limited to once-per-day, requires a certain number of games played since the last reset this suspect (5-10 seems like a good range for this), or maybe a combination of both. Manual RD resets would not be available outside of suspects to minimize unintended abuse
From a technical standpoint, this seems easy enough to implement, as I’ve been told by Mia and Chaos. At this point, I should mention that this proposal does not aim to prevent users from making a new alt to qualify for a suspect test, but I hope I have sufficiently demonstrated that for players who already ladder, this new procedure will be an easier and faster enough method of demonstrating their qualifications to make it the preferred option over smurfing.

Please ask if there’s anything you want me to clarify, and please comment if you think there might be some method of abusing this system I haven’t covered. Thank you.
 
Last edited:
I think, as long as this method is "safe" to use, it's fine to implement, regardless of whether players would prefer to use it. We can then see if it makes it easier in practice and iterate from there.

The question of "safe" seems to be: when a user goes from the initial RD to a precise RD, is the confidence in the ELO/GXE at the end independent of their ELO/GXE at the start? "Confidence" probably needs to be another term given a more precise definition.

Resetting the W/L, if visible on /rank, might be unattractive to players. This could be avoided in implementation by storing the # of games at reset time, and then calculating COIL with total games - stored games.
 
I think, as long as this method is "safe" to use, it's fine to implement, regardless of whether players would prefer to use it. We can then see if it makes it easier in practice and iterate from there.

The question of "safe" seems to be: when a user goes from the initial RD to a precise RD, is the confidence in the ELO/GXE at the end independent of their ELO/GXE at the start? "Confidence" probably needs to be another term given a more precise definition.

Resetting the W/L, if visible on /rank, might be unattractive to players. This could be avoided in implementation by storing the # of games at reset time, and then calculating COIL with total games - stored games.
That’s a good point about W/L. We can’t just store # of games at reset time, though. PS doesn’t store a player’s total number of games anywhere. Instead, it stores number of wins, number of losses, and number of ties, and adds those numbers up together whenever a total number of games is needed. To make what you’re describing work, we would need to archive the current W/L/T record and then add it to the new ongoing totals whenever we want to display their lifetime total. Their current W/L/T would then be used to calculate COIL; since calculating the lifetime total would require adding both totals together, there’s no need for any subtraction. I think that archived record could still be fit all into one row of the database, though, since we would never need to read or write those archived win and loss records independently of one another. If a player participates in multiple suspects in the same format with the same alt, or resets their progress in a suspect they were already participating in, you would combine their current W/L/T into their archived W/L/T all at once.
However, there are some problems with this. Obscuring how a player’s W/Ls are tracked this way does make the freely accessible “Reset W/L” button even more dangerous. Players can already potentially erase their progress in a COIL test by clicking it, and if players can’t see that their W/L record is being tracked separately for the suspect test, they might not realize that clicking that button will erase that record, too. Players also sometimes use 3rd party calculator tools to judge how many more games they will need to qualify for a suspect test with their current progress. Not being able to see the number of games being used to calculate their COIL could impede their use of these calculators, but number of games can be easily calculated given GXE and COIL, so this is easily worked around. Still, I’d imagine some players would like to be able to see how many games they’ve played during the current suspect without having to do any reverse engineering, so perhaps that information could be displayed on Smogon’s suspect tool once they’ve qualified.

I’m not really sure what other word to use besides “confidence”. “Confidence interval” is a commonly used term in statistics and probability, and while I think RD is technically not a confidence interval in the strictest sense, it functions very similarly to one. It represents the range in which we are confident that the given player’s true skill lies within; the narrower the range, the more confident we are that the base rating is close to being correct.

Elo is essentially independent from Glicko. Unlike Glicko, the Elo system doesn’t have any kind of confidence value. However, Pokemon Showdown use a player’s Glicko RD being above 100 or not to determine whether to show a player on the Elo-sorted leaderboards, so you could say that’s our confidence value for Elo. That said, a player who had <101 RD before resetting only needs to play 5 games to get it back down to 100, effectively restoring Pokémon Showdown’s full confidence in their Elo. As for Glicko, if a player has 70 RD before they reset, it should only take about 20 games to get back to where it was, meaning PS will be just as confident in the accuracy of their Glicko rating at that point as it was before. However, the smaller their RD was before resetting, the longer it will take. The minimum RD on Pokémon Showdown is 25, which takes quite a while to reach from 130. 50 RD is reachable in 40 games, but 40 RD takes around 70 games, 30 RD takes 125, and for a player to reach the minimum 25 RD, representing the highest possible confidence in their Glicko rating, they would need upwards of 180 games, not counting the daily loss of confidence built into the Glicko system that affects all players, regardless of activity. If this, like the W/L reset, is undesirable, we could preserve their original Glicko by storing their RD 130 Glicko used for the suspect in a separate row of the database. These two scores could then be tracked and updated in parallel as the user ladders, with the suspect test Glicko being the one displayed while the test is ongoing and the one used to calculate COIL. Then, when the testing period is over, PS would switch to displaying their true Glicko rating and GXE, which would be no different than if they had played all the same matches with the same results without having opted into the suspect test. In this way, any damage to longstanding records is completely nullified. This is a really, really good idea, if I do say so myself, so consider it part of my proposal going forward.
 
Last edited:
The W/L reset button was probably added when COIL was discontinued. I wonder if we should still have that option, or if at least it should disqualify the alt in the current suspect system.

Maintaining two sets of scores sounds complicated but is probably possible.
 
Here’s everything I think maintaining two sets of scores would entail:
When a player uses the command/button to opt into a suspect test, their current Glicko rating is copied into a duplicate entry in the table with the RD set to 130 (and the current date and time is stored, as well as the player’s current total W/L/T). We would add a function to check if a player is participating in a suspect in a current format, which could look something like this:
Code:
User#getSuspectRating(format) {
   if (format.suspect?.ongoing && this.lastOptInTime > format.suspect.startTime) {
      return this.suspectRating;
   }
   return null;
}
When the player finishes a ranked match, both ratings get updated which I think should be as simple as duplicating the single block of code that handles updating a given player’s Glicko using the above function; if they have two ratings, an player’s normal rating would always be used to calculate how their opponent’s rating would change. Code that gets a player’s GXE would do something like user.getSuspectRating(format) || user.glicko unless it’s specifically for a suspect in which case it would do something else if the player isn’t participating in the suspect. Code that calculates a player’s COIL could just always use the suspect rating since there’s no actual need to display a user’s COIL for a format that doesn’t have a suspect active with the automation that suspect tests have now.

I think that’s all? There’s no actual need to clear the suspect rating entry when the suspect ends since it wouldn’t be used for anything at that point.
 
After some input from Slayer95, I’d like to revise some implementation details. First, he suggested that it would be preferable to use the extra database column needed to track the number of suspect test games for simply a running total of games played during the test, rather than the total of games played before the suspect. He said that it would be nice to have that information available after the testing period so leaders can take it into consideration when deciding what reqs they want to set in the future. I agree that storing the games count directly like this is a better way of accomplishing that and is otherwise not meaningfully worse than the previously suggested approach.
Second, he suggested that rather than store the date and time of the most recent suspect-test-opt-in, we only need to store a unique ID for the suspect test to know if a player’s suspect test data is relevant to the current test. This would certainly be lighter on storage space in the database; it’s hard to imagine any single format in any particular generation holding over 255 suspect tests in its lifetime, so a suspect test ID could easily fit into 1 or 2 bytes. This is much smaller than the big chunk of space a time code would need, but it would prevent us from throttling reset attempts on a timer, so a minimum games requirement would need to be used in that case. On that note, I do want to clarify that accounts whose first ever battle happened after the start of the suspect test would still be eligible to qualify under this system; they would not need a suspect test ID associated with them if they are sufficiently fresh.

On the topic of storage space, Slayer also brought to my attention that PS is apparently storing Glicko in 16 bytes per player per format, using 8 bytes each for rating and rating deviation. Glicko rating realistically falls within the range of 1000-2000 on PS, so it’s inconceivable that any player’s would ever exceed the bounds of a 2-byte signed integer, and rating deviation is strictly constrained to a range of 25-130, which easily fits within a single byte. If it’s feasible to resize those columns of our database to free up the 13 bytes of wasted space, we could easily fit the other 6 bytes needed for the extra 3 bytes of secondary, suspect-test-specific Glicko score, 2 bytes of games played during the current suspect (a player would need to play 1 battle per minute every minute of every hour of every day for over 6 and a half weeks to overflow an unsigned short), and 1 byte needed for a suspect ID.

Edit: The limit for battles is 4 battles per minute, so a bot could theoretically play 65,536 battles in only a week and a half of spamming at the absolute limit, which is short enough to fit within a suspect testing period. Bump that one up to 4 bytes if that’s a concern.

Edit 2: I’ve since realized that a proper implementation of Glicko-1 is meant to have decimal precision for RD, while R is still meant to be rounded to the nearest whole. Therefore, we need at least 4 bytes to store each RD value as a float, and it may be simpler to leave the existing ones as 8 byte doubles. I think 4 bytes is plenty of precision for a value that is strictly confined to the range of [25.0, 130.0], especially since the value is only ever presented to users as a whole number.
 
Last edited:
I don't think policy review is the best venue to save singular bytes in a database schema :p

Regardless, as the sort-of-co-sponsor of this proposal, I'm beginning to seriously doubt whether resetting deviation is desirable. On a practical level, resetting everyone's deviation (and thus among other things removing them from the leaderboard) seems a harsh measure, and tracking multiple ratings for every player to switch to post-suspect seems a nightmare. Either of these would lead to a plethora of not-unreasonable complaints from suspect test participants.

On a theoretical level, the point is to have an (at least somewhat) accurate assessment of a player's skill. But that's the purpose of Glicko/GXE both in general and in the context of suspect tests, so why would they require different rating systems? It seems better to simply take the more accurate rating and use it for both purposes.

Then it comes down to which is the more accurate rating, and I think that's clearly the one that's not been reset. If someone has very high Glicko with low deviation, that means that they recently displayed exceptional skill on the ladder. To reset this when the suspect test starts is to pretend like there is almost no transferable skill between laddering and laddering a couple days later. Even in the very rare case of a suspect test that includes a changed element (such as unbanning the suspect mon on ladder during the test), we surely agree that the best ladder players will still be good. In the much more common scenario, literally nothing has changed except a suspect test having been started. We want these players to play games with the suspect in mind to qualify, of course, but note that COIL still demands games played.

And resetting can not just be harmful to good players, it can also unjustly reward players who should not get requirements. Resetting deviation while starting from a medium-high GXE can allow a medium-skilled player who shouldn't get reqs to balloon their rating in a few lucky games. My main point, I suppose, is that resetting deviation is not risk-free, and in my opinion also not needed.

To take the example of adriano spadato, I would argue that them getting reqs despite losing most of their games is not too bad, as they've shown their competence already. Indeed, they only "lose most of their games" if you refuse to look at their older games. With such a high GXE (lower now, it seems like they lost since I started typing this) they must have won a vast majority of games leading up to the suspect test. But even if you disagree with this viewpoint, I can assuage some concerns. Winning 1 in 7 games, they would absolutely not get reqs. See, the simulation relies on them getting matched with people with roughly the same Glicko and deviation, but those people don't exist. If you're the highest Glicko around, you'll only battle against people with much lower scores, and thus lose many more points on a loss, which them dropping down to 89.5% during the writing of this post seems to prove. Taking the current highest GXE on the OU ladder, "cadaei b fece" is the only one with GXE>90% and has 93.1% with 2001 ± 74 (very impressive!). The two players around them on the ladder, who would be the most likely to face them, have 1818 and 1777 Glicko respectively. Losing to them would cause a significant Glicko loss, thus requiring even this player to perform somewhat well to qualify.

Thus, I find myself back at my original proposal in devcord. Simply using COIL on unrestricted accounts, with the COIL games-counter starting when joining the suspect test. This seems easiest to implement, easiest to understand, and mathematically the most correct. It also prevents suspect tests being used to manipulate the ladder in any way, as they will not affect the normal mechanisms, which is ideal. For players who really don't want to play on their main in a suspect test, they can have a suspect alt, or they can still start fresh if preferred.
 
I don't think policy review is the best venue to save singular bytes in a database schema :p

Regardless, as the sort-of-co-sponsor of this proposal, I'm beginning to seriously doubt whether resetting deviation is desirable. On a practical level, resetting everyone's deviation (and thus among other things removing them from the leaderboard) seems a harsh measure, and tracking multiple ratings for every player to switch to post-suspect seems a nightmare. Either of these would lead to a plethora of not-unreasonable complaints from suspect test participants.

On a theoretical level, the point is to have an (at least somewhat) accurate assessment of a player's skill. But that's the purpose of Glicko/GXE both in general and in the context of suspect tests, so why would they require different rating systems? It seems better to simply take the more accurate rating and use it for both purposes.

Then it comes down to which is the more accurate rating, and I think that's clearly the one that's not been reset. If someone has very high Glicko with low deviation, that means that they recently displayed exceptional skill on the ladder. To reset this when the suspect test starts is to pretend like there is almost no transferable skill between laddering and laddering a couple days later. Even in the very rare case of a suspect test that includes a changed element (such as unbanning the suspect mon on ladder during the test), we surely agree that the best ladder players will still be good. In the much more common scenario, literally nothing has changed except a suspect test having been started. We want these players to play games with the suspect in mind to qualify, of course, but note that COIL still demands games played.

And resetting can not just be harmful to good players, it can also unjustly reward players who should not get requirements. Resetting deviation while starting from a medium-high GXE can allow a medium-skilled player who shouldn't get reqs to balloon their rating in a few lucky games. My main point, I suppose, is that resetting deviation is not risk-free, and in my opinion also not needed.

To take the example of adriano spadato, I would argue that them getting reqs despite losing most of their games is not too bad, as they've shown their competence already. Indeed, they only "lose most of their games" if you refuse to look at their older games. With such a high GXE (lower now, it seems like they lost since I started typing this) they must have won a vast majority of games leading up to the suspect test. But even if you disagree with this viewpoint, I can assuage some concerns. Winning 1 in 7 games, they would absolutely not get reqs. See, the simulation relies on them getting matched with people with roughly the same Glicko and deviation, but those people don't exist. If you're the highest Glicko around, you'll only battle against people with much lower scores, and thus lose many more points on a loss, which them dropping down to 89.5% during the writing of this post seems to prove. Taking the current highest GXE on the OU ladder, "cadaei b fece" is the only one with GXE>90% and has 93.1% with 2001 ± 74 (very impressive!). The two players around them on the ladder, who would be the most likely to face them, have 1818 and 1777 Glicko respectively. Losing to them would cause a significant Glicko loss, thus requiring even this player to perform somewhat well to qualify.

Thus, I find myself back at my original proposal in devcord. Simply using COIL on unrestricted accounts, with the COIL games-counter starting when joining the suspect test. This seems easiest to implement, easiest to understand, and mathematically the most correct. It also prevents suspect tests being used to manipulate the ladder in any way, as they will not affect the normal mechanisms, which is ideal. For players who really don't want to play on their main in a suspect test, they can have a suspect alt, or they can still start fresh if preferred.
In other words, while there are concerns for potential abuses in both systems, without an RD reset the abuses would likely often come from over-qualified players, and with an RD reset the abuses would likely often come from under-qualified players. When you put it that way, it does sound like resetting RD is more trouble than it’s worth. Recent events in OU seem to suggest that there is cause for concern for abuse from high-level players, but whatever windows your proposal would open for them would be much more transparent avenues of “abuse” and about as far from cheating as an exploit could be, and more importantly I think any option that allows these players to skip the low ladder grind will remove most of the incentive to cheat in the first place. There’s potential for boosting-related abuses in both systems, but now that I think about it, letting players control when they reset their RD would almost certainly make that worse. I can agree, then, that your proposal is the better option for COIL tests. However, it does nothing for tests with simple GXE requirements. COIL is not easier to understand than a GXE + Elo requirement for players or leaders, in part because players don’t have COIL unless a suspect test is underway. It’s simply more difficult for leaders to figure out what COIL target to set for their format, and it’s more difficult for players to understand what they need to do to meet one. Committing to only tracking games separately for COIL tests is committing to either only using COIL for test requirements for all suspect tests or forcing those that don’t use COIL to resort to requiring fresh alts or tournament placement. If everyone is on board with COIL tests being the site-wide standard, then I wholeheartedly agree with ditching the RD reset aspect of the proposal, but otherwise it should at least be available as an option that can be enabled when creating a suspect test.
 
I don't think policy review is the best venue to save singular bytes in a database schema :p

Regardless, as the sort-of-co-sponsor of this proposal, I'm beginning to seriously doubt whether resetting deviation is desirable. On a practical level, resetting everyone's deviation (and thus among other things removing them from the leaderboard) seems a harsh measure, and tracking multiple ratings for every player to switch to post-suspect seems a nightmare. Either of these would lead to a plethora of not-unreasonable complaints from suspect test participants.

On a theoretical level, the point is to have an (at least somewhat) accurate assessment of a player's skill. But that's the purpose of Glicko/GXE both in general and in the context of suspect tests, so why would they require different rating systems? It seems better to simply take the more accurate rating and use it for both purposes.

Then it comes down to which is the more accurate rating, and I think that's clearly the one that's not been reset. If someone has very high Glicko with low deviation, that means that they recently displayed exceptional skill on the ladder. To reset this when the suspect test starts is to pretend like there is almost no transferable skill between laddering and laddering a couple days later. Even in the very rare case of a suspect test that includes a changed element (such as unbanning the suspect mon on ladder during the test), we surely agree that the best ladder players will still be good. In the much more common scenario, literally nothing has changed except a suspect test having been started. We want these players to play games with the suspect in mind to qualify, of course, but note that COIL still demands games played.

And resetting can not just be harmful to good players, it can also unjustly reward players who should not get requirements. Resetting deviation while starting from a medium-high GXE can allow a medium-skilled player who shouldn't get reqs to balloon their rating in a few lucky games. My main point, I suppose, is that resetting deviation is not risk-free, and in my opinion also not needed.

To take the example of adriano spadato, I would argue that them getting reqs despite losing most of their games is not too bad, as they've shown their competence already. Indeed, they only "lose most of their games" if you refuse to look at their older games. With such a high GXE (lower now, it seems like they lost since I started typing this) they must have won a vast majority of games leading up to the suspect test. But even if you disagree with this viewpoint, I can assuage some concerns. Winning 1 in 7 games, they would absolutely not get reqs. See, the simulation relies on them getting matched with people with roughly the same Glicko and deviation, but those people don't exist. If you're the highest Glicko around, you'll only battle against people with much lower scores, and thus lose many more points on a loss, which them dropping down to 89.5% during the writing of this post seems to prove. Taking the current highest GXE on the OU ladder, "cadaei b fece" is the only one with GXE>90% and has 93.1% with 2001 ± 74 (very impressive!). The two players around them on the ladder, who would be the most likely to face them, have 1818 and 1777 Glicko respectively. Losing to them would cause a significant Glicko loss, thus requiring even this player to perform somewhat well to qualify.

Thus, I find myself back at my original proposal in devcord. Simply using COIL on unrestricted accounts, with the COIL games-counter starting when joining the suspect test. This seems easiest to implement, easiest to understand, and mathematically the most correct. It also prevents suspect tests being used to manipulate the ladder in any way, as they will not affect the normal mechanisms, which is ideal. For players who really don't want to play on their main in a suspect test, they can have a suspect alt, or they can still start fresh if preferred.
After considering this more, I’ve realized that if we’re only going to track games played during the suspect and otherwise leave existing accounts’ ratings untouched, there’s no good reason to allow players to manually reset their game count on their established alts or even require them to manually opt in in the first place. That is, it would be in every player’s best interest to opt in as soon as possible, so why not just start counting games for everyone automatically? The way COIL works, all any player needs is to at some point simultaneously achieve a high enough GXE and game count. If a GXE of 82.0% requires 20 games to hit a particular target, it doesn’t matter if a player fails to achieve that GXE by their 20th game; having 82.0% GXE at any point after having played at least 20 games is enough to qualify. A player who fails to qualify for 50 games due to low GXE can still eventually qualify if they turn things around and raise their GXE to something that qualifies in 75 games, while a player who played consistently at that level would still need around 75 games—potentially less if they hit an upwards fluctuation towards the end, but also potentially more if they don’t. COIL is designed to protect against early lucky streaks, but in doing so it rewards late ones. Still, the exceptional simplicity of being able to automatically opt everyone in with no action on their part needed is a huge plus in favor of your proposal.

However, I think you overstated the significance and likelihood of a player with “medium-high” GXE getting lucky and ballooning their rating in a few games. In a 2900 COIL test with B=4, a player who has only played 13 games would need a GXE of at least 89.8% to qualify. A player whose real skill level is 80.0% GXE—someone who already can qualify if they play just 29 games at their level—would need one hell of a lucky streak to be at 89.8% GXE after 13 games, even with an RD reset. Every full point of GXE less than 90 requires an additional game, down to 84.0% GXE requiring 19 games. Beyond that point, additional losses from whatever early peak they reached increase the number of games they need faster and faster. I’m not really sure if 80.0% matches what you consider “medium-high”, but I’d like to think “medium-high” players should be able to get reqs without needing to get lucky. A COIL target of 2900 hard requires a GXE above a mere 72.5%, and a B value of 4 means that a player needs just 74.5% GXE to be able to qualify in less than 100 games. That is to say, any player who is remotely within range of reaching 85%+ GXE and still being there after 20 games, even with the RD reset, is almost certainly someone who should be able to get reqs, and at worst is reaching them in half as many games as most players of their skill level. On that note, I will once again point out that the goal of this change is to disincentivize smurfing by providing established, likely to qualify players a faster and less disruptive option.

Now, on the topic of middle-high GXEs and trying to disincentivize smurfing, 70-80 GXE is around the range where both our proposals struggle to address a particular problem that has yet to be discussed.
IMG_0538.jpeg

In this graph, the horizontal axis represents a player’s true GXE, and the vertical axis represents number of games played. The reddish region encompasses all the ways a player can be qualified for a COIL target (this particular example is a 2800 target with B = 4). The blue line represents a very rough estimate of how many games a player of a given GXE needs to get back to that level from a fresh alt. The exact curve isn’t important, it’s just there to demonstrate that such a curve as a steadily increasing one would cross the decreasing red curve at some point. Now, in our current suspect test system, only the part of the red region that is overlapped by the blue one is realistically achievable when qualifying for suspect tests, and the goal of this thread is to find a way for players to land in the red region below the blue line, saving themselves time. But time is exactly the problem with the portion of the graph where red, blue, and green overlap, the narrow band of GXEs for which this hypothetical test requires more games to meet the COIL target than it does for players at that level to get there from a fresh alt. For these players, it is almost certainly faster to smurf, as their first 35 or so battles will be against players below their skill level, resulting in less time actually spent battling. This time-saving opportunity likely extends somewhat beyond the green line, as 30 quick battles can still be shorter than 20 evenly matched ones. I think my proposal handles this problem better, because the chance for advantage gained from an early lucky break that an RD reset provides for players in and near this range helps incentivize them not to smurf. After all, an early lucky streak from their main’s resting point lets them potentially qualify in 20 games instead of 40, where the same luck from a fresh start would be near meaningless to their win rate and speed, and at best would only let them reach their resting GXE sooner where they would just have to keep playing to pad out their game count. On the other hand, an early streak of bad luck would at worst put a player starting from their resting GXE into an lower-level matchmaking pool which they can easily climb out of and still have an simpler time regaining their lost GXE than they would reaching it from a fresh alt with neutral luck, let alone a bad start.

This brings me to a problem with not having players reset their RD when they participate in a suspect: they already can anyway. As has been explained previously, the Glicko system is designed with RD “decay” built in. The theory is that a rating of a player who has been sufficiently inactive is as uncertain as the default rating given to a brand new player, so everyone’s RD is always trending towards the maximum value and only decreases with activity. As such, a player who has been inactive for a while can already have an account with middle-high Glicko rating and 100-130 RD, letting them take advantage of the opportunities you outline as being unfairly advantageous. Now, in a normal Glicko system this wouldn’t be much of an issue, as players aren’t supposed to have more than one Glicko rating to their name. They would have to genuinely be inactive, at least within the given ranked system, for around a year or more just to have one chance at an early lucky streak. But Pokemon Showdown isn’t a normal Glicko system. Pokémon Showdown lets players have as many alts as they like, so they can feasibly set up an alt in advance and leave its rating to decay while laddering on their main, and thanks to our current suspect test system and OLTs, for years Smogon has been encouraging, nay, requiring them to perform the steps to set these alts up regularly. There are hundreds of alts like these on every main ladder, so dozens of players in every format are already set up to gain these unfair advantages while many more players are not. This makes those advantages even less fair by leaving them only accessible to users who have (unwittingly) prepared for them long in advance.

Now, you may be thinking: “if those early lucky streaks really are lucky, wouldn’t they need to burn multiple alts per suspect test on average to qualify? Surely they’d run out eventually.” Well, for starters, if they have enough alts—which they very well may, given that these are likely players who were already having to make many alts per suspect test in the past—then they can get a rotation going where, by the time they’ve gone through all their alts, the ones they used first will have decayed back to an abusable position. As for players who don’t have enough alts… well, they can just make more. In other words, your proposal doesn’t actually prevent half-smurfing abuses; it only makes them inaccessible to everyone who hasn’t already been smurfing and makes them harder in a way that incentivizes more smurfing to make up the difference. It is my design philosophy that, if an exploit cannot be completely removed or defended against, it should be made an intentional feature. Both so that it can be more easily designed around and accounted for, and so it can be equally accessible to everyone and not just the ill-meaning who are able and willing to find and leverage said exploit.

As I said before in this post, COIL is designed to protect against early lucky streaks. If there is still concern for the potential for that kind of abuse as you have described it in the system I am proposing, then I will reiterate that COIL is customizable, so it can be calibrated to defend against said streaks more strongly without adding too many extra games to the requirements for players with GXEs closer to the intended minimum. As I said in my original post, raising the B value for a COIL test raises the minimum games requirement for everyone, and I’d like to elaborate on that here. Again, setting a COIL requirement for a suspect test requires choosing two values: a B value and a COIL target. The COIL target is really what sets the minimum GXE; it’s simply 40 times the highest GXE that cannot qualify for the suspect, no matter how many games are played. This also sets the baseline for how many games higher GXEs need to qualify; the closer they are to that threshold, the more they’ll need. However, that threshold is not realistically the minimum. A COIL target of 2900 setting a GXE threshold of 72.5% would require a player with 72.6% GXE to have played a whopping 500 games to qualify. And that’s with a B value of 1; the B value in the COIL formula is literally just a flat multiplier on how many games everyone needs to qualify. Adjusting the B value in that previous example to 4, now players with 72.9% GXE need 500 games! Essentially, raising the B value raises the realistic GXE requirement beyond the absolute floor set by the COIL target. So you can simultaneously raise the B value and lower the COIL target a little to flatten out the number of games required for everyone while preserving the realistic GXE minimum. And I do mean “a little”. Adjusting a 2900 COIL B=4 test to 2880 COIL B=5 requires 2 to 6 more games from everyone above 74.8% GXE (the additional games needed peaks at 76.9% GXE), an unchanged 100 games for 74.5% GXE, and while this would mean GXEs lower than 74.5% that are capable of qualifying require fewer games, this doesn’t really look significant to me until 73.2% GXE where a 289 game requirement shrinks to 210. Hopefully this extremely minor adjustment and others like it are enough to push these hypothetical undeserved lucky streaks far enough out of the realm of possibility.
 
Last edited:
Resetting the game count and then recalculating COIL for existing accounts feels like its a good solution. Most people who are going for reqs have 75-85% GXE on their mains, and that results in a pretty reasonable game count. This also helps with preventing frustration, as the GXE for a "mature" account is more stable, and would get rid of the folks resetting 10 times until their fresh alt goes 20-0.
 
Back
Top