Resource Everything You Ever Wanted to Know About Ratings

pokemonisfun · Jan 29, 2023

Sijih said:
Here are two recreations of the graph for you. The left one is probably what it looked like. In the right one I added indicators of where the 1500 and 1700 numbers Antar is using in the examples are.

I'm pretty sure you and I may be the only people who ever look at this thread, but I'll post this graph here first so anyone can give feedback/make the graph look nicer. Maybe I could add some shading to show probabilities, or add lines to indicate where the distribution means are.

I can also post the code I used to generate this if anyone wants.

If there are no objections I'll ask a member of upper staff to edit one of my graphs into the post in a day or three.

My only comment for now: THANK YOU!

genisu · Oct 27, 2023

Very informative, helped me understand showdown ratings tysm!

Shadowys · Dec 31, 2024

I'm redirected here to provide a critical review of Glicko and its misuse through GXE in Smogon.

Rating systems are fundamental to competitive gaming, helping match players of similar skill and track improvement over time. The Glicko rating system, developed by Mark Glickman, improves upon the Elo system by introducing a measure of rating reliability. Serious competitive electronic games such as Dota2 and League have adopted it or similar systems to matchmake players. Here in smogon however, we instead use a derived approximation called GXE by X-Act (original post here). My goal is to show how Glicko by itself is suitable for ladder tiering, tournament seeding, and how GXE is ultimately harmful in any serious use.

On Glicko

As mentioned in the OP, the Glicko system uses two primary numbers to track player skill: Rating (R) and Rating Deviation (RD). The rating, typically starting at 1500, represents the estimated skill level. The Rating Deviation, usually starting at 350 (but for smogon we may be using 130), indicates the uncertainty in this estimate. This two-number approach is crucial because it tells us not just how good we think a player is, but how confident we are in that assessment.

Rating Deviation naturally evolves throughout a player's career. It decreases as players complete more games, increases during periods of inactivity, and responds to the variety of opponents faced. A player who regularly competes against diverse opponents will typically see their RD decrease more rapidly than one who faces the same opponents repeatedly.

Typical RD ranges for different player categories:
- New players: ~350
- Active players: 60-110
- Regular players: 30-80
- Very active/professional players: 20-40

On GLIXARE aka GXE
(original post here) A formula called GLIXARE was proposed to convert Glicko ratings into a single number:

Code:

GLIXARE Rating = 0, if RD > 100

GLIXARE Rating = round(10000 / (1 + 10^(((1500 - R) * pi / sqrt(3 * ln(10)^2 * RD^2 + 2500 * (64 * pi^2 + 147 * ln(10)^2)))))) / 100, otherwise

GXE as an absolute number was proposed as a means to replace CREs as a definitive measure of the player's skill rating because it may be difficult to compare players. This is a fundamental misuse of Glicko-1, which is a statistical formula, to produce some absolute number to rank or tier players for example its use for ladder requirements in suspect tests.

The GLIXARE formula suffers from several fundamental problems. First, its RD threshold of 100 is unrealistically low. Most active players naturally have RD values above 100, meaning the formula would assign them a rating of zero. This creates artificial "dead zones" where ratings become meaningless and completely ignores valid skill information from newer players.

The mathematical structure of GLIXARE introduces additional concerns. Its unnecessarily complex scaling and non-linear RD effects create unpredictable rating changes. The formula misuses RD's statistical properties and rigidly centers everything around 1500, limiting its flexibility. These issues make it poorly suited for practical applications like tracking player progress, facilitating matchmaking and subsequently its usage as ladder reqs.

The inherent flaws is surfaced as dubious player behaviours shown in getting ladder reqs with GXE: starting multiple new alts to get that lucky streak (of which COIL was an attempt to mitigate this), and some excellent players spending too much time trying to get reqs because they were simply unlucky. This isn't sustainable and an unneeded waste of system resources and human effort.

Better Ways to Use Glicko for Smogon

Instead of forcing ratings into a single number, we should be using R and RD from Glicko, (with an optional minimum amount of games) instead.

To quote the typical RD ranges again:
- New players: ~350
- Active players: 60-110
- Regular players: 30-80
- Very active/professional players: 20-40

We can set ladder reqs to be say R >= 1700 rating with RD <= 100 to filter out for regular players, or for professional/tournament winner adjacent players to be R>=1800 and RD <=50. This would encourage participation among excellent players by using a more accurate range of values instead of the artificial number like GLIXARE which is a fundamental misunderstanding of how Glicko was meant to be a statistical formula instead of an absolute number. This change would also encourage players to just stick with the one account used for ladder reqs, because Glicko-1 is relatively fast to respond to rating changes.

Edit: To explain further why is professional players marked as R>=1800 and RD<=50, this means that there's a 95% condidence that the player's true elo is between 1742-1858, and 1800 is the typical range of professional players for other games. (The beauty of Glicko is that it ignores what is being played and produces a very accurate and fast way to get the true rating of a player). Consequently said professional player has a 55%-95% range of win rate against a player of R=1500 and RD=130. The "best estimate" provided by direct comparing two numbers, which is what GXE uses is 85% but neglecting to mention that it has a high deviation of 133 ratings, and thats the problem with GXE, it doesnt tell the full story.

---

While attempts to simplify Glicko into a single number like GLIXARE are understandable, they often sacrifice valuable information about rating uncertainty. Instead, embrace Glicko's two-number system and use confidence intervals for tiering players. This approach provides a more nuanced and accurate picture of player skill while accounting for the natural uncertainty in skill assessment.

pyuk · Dec 31, 2024

Shadowys said:
I'm redirected here to provide a critical review of Glicko and its misuse through GXE in Smogon.

Rating systems are fundamental to competitive gaming, helping match players of similar skill and track improvement over time. The Glicko rating system, developed by Mark Glickman, improves upon the Elo system by introducing a measure of rating reliability. Serious competitive electronic games such as Dota2 and League have adopted it or similar systems to matchmake players. Here in smogon however, we instead use a derived approximation called GXE by X-Act (original post here). My goal is to show how Glicko by itself is suitable for ladder tiering, tournament seeding, and how GXE is ultimately harmful in any serious use.

On Glicko

As mentioned in the OP, the Glicko system uses two primary numbers to track player skill: Rating (R) and Rating Deviation (RD). The rating, typically starting at 1500, represents the estimated skill level. The Rating Deviation, usually starting at 350 (but for smogon we may be using 130), indicates the uncertainty in this estimate. This two-number approach is crucial because it tells us not just how good we think a player is, but how confident we are in that assessment.

Rating Deviation naturally evolves throughout a player's career. It decreases as players complete more games, increases during periods of inactivity, and responds to the variety of opponents faced. A player who regularly competes against diverse opponents will typically see their RD decrease more rapidly than one who faces the same opponents repeatedly.

Typical RD ranges for different player categories:
- New players: ~350
- Active players: 60-110
- Regular players: 30-80
- Very active/professional players: 20-40

On GLIXARE aka GXE
(original post here) A formula called GLIXARE was proposed to convert Glicko ratings into a single number:

Code:

GLIXARE Rating = 0, if RD > 100 GLIXARE Rating = round(10000 / (1 + 10^(((1500 - R) * pi / sqrt(3 * ln(10)^2 * RD^2 + 2500 * (64 * pi^2 + 147 * ln(10)^2)))))) / 100, otherwise

GXE as an absolute number was proposed as a means to replace CREs as a definitive measure of the player's skill rating because it may be difficult to compare players. This is a fundamental misuse of Glicko-1, which is a statistical formula, to produce some absolute number to rank or tier players for example its use for ladder requirements in suspect tests.

The GLIXARE formula suffers from several fundamental problems. First, its RD threshold of 100 is unrealistically low. Most active players naturally have RD values above 100, meaning the formula would assign them a rating of zero. This creates artificial "dead zones" where ratings become meaningless and completely ignores valid skill information from newer players.

The mathematical structure of GLIXARE introduces additional concerns. Its unnecessarily complex scaling and non-linear RD effects create unpredictable rating changes. The formula misuses RD's statistical properties and rigidly centers everything around 1500, limiting its flexibility. These issues make it poorly suited for practical applications like tracking player progress, facilitating matchmaking and subsequently its usage as ladder reqs.

The inherent flaws is surfaced as dubious player behaviours shown in getting ladder reqs with GXE: starting multiple new alts to get that lucky streak (of which COIL was an attempt to mitigate this), and some excellent players spending too much time trying to get reqs because they were simply unlucky. This isn't sustainable and an unneeded waste of system resources and human effort.

Better Ways to Use Glicko for Smogon

Instead of forcing ratings into a single number, we should be using R and RD from Glicko, (with an optional minimum amount of games) instead.

To quote the typical RD ranges again:
- New players: ~350
- Active players: 60-110
- Regular players: 30-80
- Very active/professional players: 20-40

We can set ladder reqs to be say R >= 1700 rating with RD <= 100 to filter out for regular players, or for professional/tournament winner adjacent players to be R>=1800 and RD <=50. This would encourage participation among excellent players by using a more accurate range of values instead of the artificial number like GLIXARE which is a fundamental misunderstanding of how Glicko was meant to be a statistical formula instead of an absolute number. This change would also encourage players to just stick with the one account used for ladder reqs, because Glicko-1 is relatively fast to respond to rating changes.

---

While attempts to simplify Glicko into a single number like GLIXARE are understandable, they often sacrifice valuable information about rating uncertainty. Instead, embrace Glicko's two-number system and use confidence intervals for tiering players. This approach provides a more nuanced and accurate picture of player skill while accounting for the natural uncertainty in skill assessment.

You seem to be misunderstanding some things. First, 130 isn’t just what new players start at; it’s the maximum* RD on PS. It only takes about 5 games, win or lose, for a player’s RD to get to or below 100, so that threshold is more appropriate than you claim. Second, we don’t use GXE for ladder ranking or matchmaking. We use Elo for both of those. GXE is only ever** used for suspect test requirements. This is notably not what X-Act originally proposed GXE be used for; GXE was only ever proposed to replace CRE as the metric by which players are ranked against one another on the ladder, as while it does not (and never was intended to) accurately estimate a given player’s average chance at beating a random opponent from the ladder, it does (at least, in X-Act’s sample case) estimate this chance very accurately for the sole purpose of ranking players against each other. To that end, GXE is only meaningful in context, and I agree that it is inappropriate for suspect tests whose requirements are founded in absolutes derived with mathematical rigor. However, suspect test targets are hardly so objective. They are, like your suggested R and RD targets for suspects seem to be, estimations based on an eyeballing of the ladder rankings and feel. For suspect test targets that are produced from the context of the ladder, GXE should function well enough as a target to beat. Third, GXE is ultimately just Mark Glickman’s own formula for calculating the chance of one player beating another used against the best standardized dummy value available. To call it “unnecessarily complex” seems like a condemnation of the entire Glicko-1 system. Indeed, that formula of Glickman’s is the entire basis for the Glicko-1 system, as it is the only means by which Glicko-1 ratings are meaningfully comparable. It was with that formula that X-Act calculated the “true rankings” of the players in their sample case and found GXE to be a very close approximation of those rankings if not the actual chances-to-beat themselves.

Now, perhaps our tier leaders are setting uninformed GXE/COIL targets for lack of ladders ranked by GXE. This is a genuine concern; GXE isn’t being used for the one thing it actually gets right, so how can we expect that anyone understands what a given GXE rating actually means? That said, Glicko targets would be similarly dubious, as our ladders aren’t sorted by Glicko either. Indeed, they can’t be, as Glicko ratings consist of two independent numbers. In order to use them for rankings, one must first coerce them into a single number, which is what CRE, ACRE, and GXE are all designed to do, and I believe I have listed those in order of ascending effectiveness.

I will say that I do have my own issue with COIL in particular. The COIL formula is continuous at GXE >= 100.0%, and it seems to me that the theoretical formula that exactly correctly calculates what COIL tries to approximate should not be continuous at that range. After all, it is of course utterly impossible for anyone to have over 100% GXE***. I won’t go into more detail in this post on that matter for the same reason I opted not to bring it up in my ongoing Policy Review thread on suspect test reform, but I thought I’d at least mention it here as long as we’re criticizing rating systems.

* I recently discovered a likely-long-standing bug with PS’s Glicko-1 implementation that causes it to not apply the RD maximum when applying decay to inactive players, which lets those players’ RDs balloon to unlimited heights. I have already opened a pull request to fix this, but I’m just now realizing that PS is also only applying the decay formula to players who played no games at all during the rating period, when it should be applied to everyone in every rating period in a proper Glicko-1 implementation. I will shortly be amending my pull request to address that as well.

** I don’t know how OLTs work. They and some unofficial competitions may be using GXE as well.

*** It is theoretically possible for a player’s GXE to be 100.0%, as GXE is rounded to the nearest tenth of a percent, but show me a player with 99.96% GXE, and I’ll show you a cheater.

Shadowys · Dec 31, 2024

pyuk said:
Third, GXE is ultimately just Mark Glickman’s own formula for calculating the chance of one player beating another used against the best standardized dummy value available

1. The typical max RD is 350, smogon has changed that to 130. I dont think theres anything different I am quoting here. The arbitrary deadzone where the initial player skill is ignored is why GXE is prone to initial streaks while Glicko, with its two parameters of R and RD is much more accurate with a less number of games. In other games that uses Glicko, placement matches function the same, they don’t ignore new player skill.

2. Yes we don’t, this is only referring to other games that do use Glicko for general ranking, and I’m singling GXE out for its attempted use at suspect tests which serves to rank players into qualified vs unqualified players. I disagree that GXE is accurate, it disregards important information about RD and operates on R and RD as single numbers instead of a range of values.

3. No, Glicko on its own retains the uncertainty of trying to predict someone’s rating, while GXE removes that by trying to collapse the RD into some probability that is based on arbitrary variables. On the contrary, Glicko did not fold R and RD itself into a single value for obvious reasons: R by itself is already a comparative value to predict whether a given player can beat another player with another R, it functions similarly to Elo rating. RD is the uncertainty in using said single number. This is why GXE is fundamentally unsound and a misuse of Glicko, it ignores the intrinsic uncertainty available in the system.

4. i selected the Glicko ratings of 1800 R and 50 RD based on having 95% accuracy that your true elo is between 1742-1858, of which 1800 is also the typical range for expert players based on other games. Using Glicko-1's formula to calculate the win probability against someone of 1500 R and 130 RD yields a range of 55%-95% win rate. Again, this is what's wrong with GXE. GXE is a "best estimate" that doesn't actually tell the full picture.

Resource Everything You Ever Wanted to Know About Ratings

pokemonisfun

Banned deucer.

genisu

Banned deucer.

Shadowys

pyuk

Shadowys