capefeather
toot
This post has been quite a few weeks coming. Over the months, I've seen certain complaints against the suspect test and the rating system that powers it. The trouble is, much of it isn't far from the truth. Proven and/or dedicated battlers aren't making reqs while some... questionable characters are. People are robbed of their req status after a string of hax-heavy battles. Finally, the current method of dealing with all this is terribly inadequate. I'm not here to scold anyone or blame anything; I'm just stating certain facts and clear trends, and I'm hoping that some change comes out of this.
A note: I do get that a lot of this isn't as clear as I thought it would be! I've edited this OP quite a few times already based on people asking me to clarify one thing or another. I'm not unwilling to edit it more for this purpose.
Ratings
(btw, lines 78-92 of this part of PO's code calculates what rating each player in a battle gains/loses. Basically, it depends on the difference between the ratings and the number of times that one or the other (that part isn't all that clear to me) has battled, though that only factors in for the first five battles on an account.)
Pokémon Online has taken several steps backward from Shoddy Battle / Pokémon Lab, and its rating system is a big part of that. Now, I'm not going to shoot my mouth off about this here without some kind of evidence, but this particular fact can be seen right from the fact that we've gone from a convergent rating system to an at best oscillatory one. What I mean is, on Shoddy Battle and PL, a player's rating fluctuates less over time as it battles. However, on PO, one can gain or lose 7-24 points no matter what, not really getting to a "true" rating but going back and forth.
This fact alone means that PO's rating system doesn't even actually try to measure a player's skill level in a precise way. Better players will tend to have higher ratings, but that just shouldn't be good enough for something as important as suspect voting. The fact that it seems to do what it's supposed to do provides a convenient illusion in which qualified voters can't help but to see complaints like this as excuses for inadequacy, so the problem is never solved.
Exclusivity
A huge problem that I find with the current suspect testing system is that it's very, very exclusive. The DPP Stage 3 Round 2 test had around 90 voters drawn from a much smaller pool. The misconception that I find people having here is that they make comparisons to other tests that notably involved writing arguments to the tiering organizers. This makes them think that the current problems can be solved by upping the requirement, making the testeven more exclusive. I hope that by the end of this post, people will see that this actually will not solve the problem without shutting out even more deserving users and making the voter selection more arbitrary.
A demonstration
For the purposes of this post, I wrote a ~100-line program on Python using the SciPy module (it's no MATLAB but eh) that attempts to simulate the probability distributions of the rating of players of different skill levels. I've made the following assumptions:
1. "New" users come in at a constant rate
2. Players battle at the same rate for 201 battles (Recently, the #1 battler was determined to have played 95 battles in about half the time of a suspect test. I'm going by this figure.)
3. If a player's rating drops below 1000, it quits (and presumably makes an alt as part of a "wave" of "new" users)
I may have forgotten an assumption that I have made. Anyways:
the code
Included is a commented introduction that reads:
"This program builds a matrix (called "rating") that looks at the "real" skill levels of each hypothetical player (rows) and gives a probability distribution of the rating that PO gives it (columns). At each "round", rating changes are recorded in a different matrix (called "new_rating"), which is then copied over onto the rating matrix to start the cycle anew. Each entry goes through a process of "searching" for other battlers; each possible matchup with a different entry has a weighted probability associated with it, and then the possible new rating probabilities are added based on the probability that the better player will beat the worse player and vice versa (a figure controlled by the constant "hax"). Additionally, the end of each "round" introduces a "wave" of new user accounts starting at rating 1000. A separate list is also made to analyze the probabilities for the best players in the "original wave", henceforth known as "star battlers"."
The simulation puts in the "kfactor" (the multiplier that controls the rating variations) properly only for the first "wave" of battlers, but it shouldn't make a terribly large difference. This simulation obtained the following results:
the results
The numbers are a bit lower than the actual results that we have seen, but please bear with me.
The main thing that is noticed is that the average rating for each skill level is actually pretty low. The rating cutoff catches the "tail ends" of these roughly bell-curve rating distributions. I don't think that it's a stretch to think that this probably happens in reality. The implications below mostly result from this.
"Special Permission"
Where does this leave the "special permission applications" that those who somehow don't make it are tasked to write? As I see it, these applications are supposed to solve the flaws inherent in the rating system. The main problem is that the instructions for these applications are really vague, and I suspect that ratings weigh heavily on who gets through here, which really just defeats the whole point of the applications.
Other thoughts
I'm not saying that most of the test was illegitimate or that the voters are a sham or anything like that. I'm just saying that this matters a lot. I may not have achieved the most realistic results on my little program, and maybe a "hax" factor of 1/8 isn't completely realistic, either, but what it says is pretty concerning to me.
The worst part for me is that all this largely stems out of the 200 rating difference minimum in ladder matchmaking. The ladder behaves like a sort of gravity well, or I guess quicksand to an extent, because, until you hit 1200, you could be facing an opponent of ANY sort of skill level, and that effect never truly fades. One battles an opponent with an incorrect rating, and the points gained or lost from that are also incorrect, resulting in incorrect rating fluctuations.
The other thing is the encouragement of alts. The fact that making an alt is optimal in certain obvious cases (e.g. you lose one of your first five battles, or you go under 1000 rating) makes this painfully clear. The thing is, people don't seem to understand that this matters a lot. No legitimate system of determining skill level would allow players to throw out win/loss records because it's "optimal". Could you imagine if they could do that in football or soccer or tennis or any other sport? So why are we so reliant on such a system here?
I don't make alts. I don't believe in gaming the ladder, even now when it practically begs me to do so. At the same time, I wish I could be higher on the ladder more often so that I could battle better players more often. But when I get haxed out of several battles to the point where you run into low-rated, boring battles, that's simply not a fun or motivating situation. Here, I'm forced to choose between honesty and a good game. I'm not willing to choose.
Hell, I don't even expect the rating system to be "fixed" any time soon (though abolishing the 200 minimum would help a lot). It would take quite a bit of effort to redo the rating system to make it as reasonable as what Shoddy Battle used to use. Despite everything, coyotte508 can do whatever he wants with his program. That is why I am going after the suspect test system instead (though obviously I can't blame the suspect test leaders, either).
Summary
So what have we found from all of this?
1. The rating system is unreliable for gathering informed voters, largely because it doesn't actually try to arrive at a single "true" rating for a player.
2. The rating system gives a convincing illusion that it does what the suspect test wants it to do. After all, better players TEND to do better. *thumbs up*
3. The suspect test voting privileges are extremely exclusive - more so than those from DPP Stage 3 Round 2 - but it has to be, for all the wrong reasons.
4. The voters and even the tiering leaders have little choice but to perpetuate the lie unwittingly, even through the "special permission applications", despite knowing or at least suspecting that something is wrong here.
Well, then give us a solution, you crybaby!
Well, I've actually proposed the solution that I'm about to present before, notably last September when Cathy attempted to take over the tiering process, but I guess there were other things on people's minds back then, and it's understandable.
Paragraphs have been seen as the ideal, but they've been rejected for taking way too much time. However, the Smogon Council system introduced an interesting alternative: an IRC conversation between the "council" members and only between them. Now, it would be quite a bit harder to organize the same system for 50+ voters and get everybody into the same conversation, but ultimately I don't see that as completely necessary. Voters should engage in conversation not only to prove that they're actually competent but also to demonstrate that they care about more than just their preferences. Of course, we'd also need to consider lowering the rating requirement so that more of the "right" people get in.
What I would personally also like to see (especially if the rating requirement isn't adjusted) is a way of letting people in. What I have in mind is not a private channel but a public one with mute on and current voters voiced, so that interested people can at least see what's going on. I see people getting "temp voice" status through alternative credentials like successful contributions to C&C/CAP or doing well in a relevant tournament. Bad apples out, good apples in.
I'm not posting all this just for my own sake. I don't fully expect to make voting privileges even if these measures were taken, though it would have been nice if that was because I had 1600 mean rating on Shoddy/PLab and it still wasn't enough, rather than what is going on currently. I don't think that I would have bothered with this if it weren't for the people that I watched experience completely unfair situations, or if a team based on Cynthia couldn't make a solid 1200 rating. This is a real problem that affects many people, and I know that a lot of people get that there really is something wrong - even if they don't want to admit it.
A note: I do get that a lot of this isn't as clear as I thought it would be! I've edited this OP quite a few times already based on people asking me to clarify one thing or another. I'm not unwilling to edit it more for this purpose.
Ratings
(btw, lines 78-92 of this part of PO's code calculates what rating each player in a battle gains/loses. Basically, it depends on the difference between the ratings and the number of times that one or the other (that part isn't all that clear to me) has battled, though that only factors in for the first five battles on an account.)
Pokémon Online has taken several steps backward from Shoddy Battle / Pokémon Lab, and its rating system is a big part of that. Now, I'm not going to shoot my mouth off about this here without some kind of evidence, but this particular fact can be seen right from the fact that we've gone from a convergent rating system to an at best oscillatory one. What I mean is, on Shoddy Battle and PL, a player's rating fluctuates less over time as it battles. However, on PO, one can gain or lose 7-24 points no matter what, not really getting to a "true" rating but going back and forth.
This fact alone means that PO's rating system doesn't even actually try to measure a player's skill level in a precise way. Better players will tend to have higher ratings, but that just shouldn't be good enough for something as important as suspect voting. The fact that it seems to do what it's supposed to do provides a convenient illusion in which qualified voters can't help but to see complaints like this as excuses for inadequacy, so the problem is never solved.
Exclusivity
A huge problem that I find with the current suspect testing system is that it's very, very exclusive. The DPP Stage 3 Round 2 test had around 90 voters drawn from a much smaller pool. The misconception that I find people having here is that they make comparisons to other tests that notably involved writing arguments to the tiering organizers. This makes them think that the current problems can be solved by upping the requirement, making the testeven more exclusive. I hope that by the end of this post, people will see that this actually will not solve the problem without shutting out even more deserving users and making the voter selection more arbitrary.
A demonstration
For the purposes of this post, I wrote a ~100-line program on Python using the SciPy module (it's no MATLAB but eh) that attempts to simulate the probability distributions of the rating of players of different skill levels. I've made the following assumptions:
1. "New" users come in at a constant rate
2. Players battle at the same rate for 201 battles (Recently, the #1 battler was determined to have played 95 battles in about half the time of a suspect test. I'm going by this figure.)
3. If a player's rating drops below 1000, it quits (and presumably makes an alt as part of a "wave" of "new" users)
I may have forgotten an assumption that I have made. Anyways:
the code
Included is a commented introduction that reads:
"This program builds a matrix (called "rating") that looks at the "real" skill levels of each hypothetical player (rows) and gives a probability distribution of the rating that PO gives it (columns). At each "round", rating changes are recorded in a different matrix (called "new_rating"), which is then copied over onto the rating matrix to start the cycle anew. Each entry goes through a process of "searching" for other battlers; each possible matchup with a different entry has a weighted probability associated with it, and then the possible new rating probabilities are added based on the probability that the better player will beat the worse player and vice versa (a figure controlled by the constant "hax"). Additionally, the end of each "round" introduces a "wave" of new user accounts starting at rating 1000. A separate list is also made to analyze the probabilities for the best players in the "original wave", henceforth known as "star battlers"."
The simulation puts in the "kfactor" (the multiplier that controls the rating variations) properly only for the first "wave" of battlers, but it shouldn't make a terribly large difference. This simulation obtained the following results:
the results
Probabilities that players of each level made reqs
1400:
5 : 0.0433088551131
6 : 0.956691144887
7 : 1.2699405546
1350:
3 : 0.00140081646388
4 : 0.0154342915184
5 : 0.133125780904
6 : 0.850039111113
7 : 1.04840930708
Proportion of *s who make 1400: 0.0111544260969
Proportion of *s who make 1350: 0.141927539828
Proportion of people who make 1400: 0.00125477482913
Proportion of people who make 1350: 0.0193391671513
1400:
5 : 0.0433088551131
6 : 0.956691144887
7 : 1.2699405546
1350:
3 : 0.00140081646388
4 : 0.0154342915184
5 : 0.133125780904
6 : 0.850039111113
7 : 1.04840930708
Proportion of *s who make 1400: 0.0111544260969
Proportion of *s who make 1350: 0.141927539828
Proportion of people who make 1400: 0.00125477482913
Proportion of people who make 1350: 0.0193391671513
The numbers are a bit lower than the actual results that we have seen, but please bear with me.
The main thing that is noticed is that the average rating for each skill level is actually pretty low. The rating cutoff catches the "tail ends" of these roughly bell-curve rating distributions. I don't think that it's a stretch to think that this probably happens in reality. The implications below mostly result from this.
Let's first look at a rating cutoff of 1400, which lets in 0.13% of all of the battlers. This is probably even more exclusive than the cutoff that we've ended up using, but perhaps the points will be made clearer this way.
Let's consider the so-called "star battlers" ("Level *"), those battlers who can beat everyone other than maybe each other barring "outside" factors AND who have participated in the entire process from t = 0. In this setup, they represent 1 in 1400 battlers, which maybe translates to four potential voters, and yet they have a 1.1% probability of actually making it into the voter list. The ones who deserve it most don't get it. Now, it's probable that people who started out a little bit later got in, but this finding stands.
We also see that about 4.33% of the voting list consists of "false positives", battlers who aren't actually the best but I guess got lucky or something. This may not seem like a huge deal at first, but when you consider how close some of the voting results have been, it becomes quite a bit more worrisome. At the whim of maybe two "false positives" out of 50, the following results could have gone differently:
Drizzle R1 (58.5%) (it could have been banned in R1)
Shadow Tag (43.6%)
Manaphy (60.5% R1, 61.7% R2) (it could have been banned in R1)
Blaziken (72.5%)
Deoxys-S (53.7%)
Latios R3 (55.0%)
Now, this is not just a meaningless hypothetical. I'm fairly certain that anybody who's really looked at the past three voter lists have found some "questionable" individuals. This is seen as inevitable, and the hope is that the rest of the population will offset them. So much for that.
Now let's lower the cutoff a little, to 1350. This is where things get really interesting. The "star battlers'" chances are increased to a not totally unhealthy 14.2% (still...) and the overall portion of voters is 13.5%. Maybe now we'll at least get most of the battlers we want into the voting pool! Alas, whether that is achieved or not, an alarming 15% of voters are "false". This would place into at least some question the following results:
Deoxys-A (95.8%)
Deoxys-N (71.8% R1, 71.1% R2)
Excadrill (37.2% R1, 34.0% R2, 23.3% R3)
"Aldaron's Proposal" (72.1%)
Latios R2 (38.6%)
Reuniclus (26.1% R2, 22.5% R3)
Thundurus (39.5%)
Brightpowder + Lax Incense (86.8%)
Drizzle R3 (35.9%)
Drought (21.6%)
...which leaves the Round 1 results of Darkrai, Latios, Shaymin-S and Moody untouched.
Let's consider the so-called "star battlers" ("Level *"), those battlers who can beat everyone other than maybe each other barring "outside" factors AND who have participated in the entire process from t = 0. In this setup, they represent 1 in 1400 battlers, which maybe translates to four potential voters, and yet they have a 1.1% probability of actually making it into the voter list. The ones who deserve it most don't get it. Now, it's probable that people who started out a little bit later got in, but this finding stands.
We also see that about 4.33% of the voting list consists of "false positives", battlers who aren't actually the best but I guess got lucky or something. This may not seem like a huge deal at first, but when you consider how close some of the voting results have been, it becomes quite a bit more worrisome. At the whim of maybe two "false positives" out of 50, the following results could have gone differently:
Drizzle R1 (58.5%) (it could have been banned in R1)
Shadow Tag (43.6%)
Manaphy (60.5% R1, 61.7% R2) (it could have been banned in R1)
Blaziken (72.5%)
Deoxys-S (53.7%)
Latios R3 (55.0%)
Now, this is not just a meaningless hypothetical. I'm fairly certain that anybody who's really looked at the past three voter lists have found some "questionable" individuals. This is seen as inevitable, and the hope is that the rest of the population will offset them. So much for that.
Now let's lower the cutoff a little, to 1350. This is where things get really interesting. The "star battlers'" chances are increased to a not totally unhealthy 14.2% (still...) and the overall portion of voters is 13.5%. Maybe now we'll at least get most of the battlers we want into the voting pool! Alas, whether that is achieved or not, an alarming 15% of voters are "false". This would place into at least some question the following results:
Deoxys-A (95.8%)
Deoxys-N (71.8% R1, 71.1% R2)
Excadrill (37.2% R1, 34.0% R2, 23.3% R3)
"Aldaron's Proposal" (72.1%)
Latios R2 (38.6%)
Reuniclus (26.1% R2, 22.5% R3)
Thundurus (39.5%)
Brightpowder + Lax Incense (86.8%)
Drizzle R3 (35.9%)
Drought (21.6%)
...which leaves the Round 1 results of Darkrai, Latios, Shaymin-S and Moody untouched.
"Special Permission"
Where does this leave the "special permission applications" that those who somehow don't make it are tasked to write? As I see it, these applications are supposed to solve the flaws inherent in the rating system. The main problem is that the instructions for these applications are really vague, and I suspect that ratings weigh heavily on who gets through here, which really just defeats the whole point of the applications.
Other thoughts
I'm not saying that most of the test was illegitimate or that the voters are a sham or anything like that. I'm just saying that this matters a lot. I may not have achieved the most realistic results on my little program, and maybe a "hax" factor of 1/8 isn't completely realistic, either, but what it says is pretty concerning to me.
The worst part for me is that all this largely stems out of the 200 rating difference minimum in ladder matchmaking. The ladder behaves like a sort of gravity well, or I guess quicksand to an extent, because, until you hit 1200, you could be facing an opponent of ANY sort of skill level, and that effect never truly fades. One battles an opponent with an incorrect rating, and the points gained or lost from that are also incorrect, resulting in incorrect rating fluctuations.
The other thing is the encouragement of alts. The fact that making an alt is optimal in certain obvious cases (e.g. you lose one of your first five battles, or you go under 1000 rating) makes this painfully clear. The thing is, people don't seem to understand that this matters a lot. No legitimate system of determining skill level would allow players to throw out win/loss records because it's "optimal". Could you imagine if they could do that in football or soccer or tennis or any other sport? So why are we so reliant on such a system here?
I don't make alts. I don't believe in gaming the ladder, even now when it practically begs me to do so. At the same time, I wish I could be higher on the ladder more often so that I could battle better players more often. But when I get haxed out of several battles to the point where you run into low-rated, boring battles, that's simply not a fun or motivating situation. Here, I'm forced to choose between honesty and a good game. I'm not willing to choose.
Hell, I don't even expect the rating system to be "fixed" any time soon (though abolishing the 200 minimum would help a lot). It would take quite a bit of effort to redo the rating system to make it as reasonable as what Shoddy Battle used to use. Despite everything, coyotte508 can do whatever he wants with his program. That is why I am going after the suspect test system instead (though obviously I can't blame the suspect test leaders, either).
Summary
So what have we found from all of this?
1. The rating system is unreliable for gathering informed voters, largely because it doesn't actually try to arrive at a single "true" rating for a player.
2. The rating system gives a convincing illusion that it does what the suspect test wants it to do. After all, better players TEND to do better. *thumbs up*
3. The suspect test voting privileges are extremely exclusive - more so than those from DPP Stage 3 Round 2 - but it has to be, for all the wrong reasons.
4. The voters and even the tiering leaders have little choice but to perpetuate the lie unwittingly, even through the "special permission applications", despite knowing or at least suspecting that something is wrong here.
Well, then give us a solution, you crybaby!
Well, I've actually proposed the solution that I'm about to present before, notably last September when Cathy attempted to take over the tiering process, but I guess there were other things on people's minds back then, and it's understandable.
Paragraphs have been seen as the ideal, but they've been rejected for taking way too much time. However, the Smogon Council system introduced an interesting alternative: an IRC conversation between the "council" members and only between them. Now, it would be quite a bit harder to organize the same system for 50+ voters and get everybody into the same conversation, but ultimately I don't see that as completely necessary. Voters should engage in conversation not only to prove that they're actually competent but also to demonstrate that they care about more than just their preferences. Of course, we'd also need to consider lowering the rating requirement so that more of the "right" people get in.
What I would personally also like to see (especially if the rating requirement isn't adjusted) is a way of letting people in. What I have in mind is not a private channel but a public one with mute on and current voters voiced, so that interested people can at least see what's going on. I see people getting "temp voice" status through alternative credentials like successful contributions to C&C/CAP or doing well in a relevant tournament. Bad apples out, good apples in.
I'm not posting all this just for my own sake. I don't fully expect to make voting privileges even if these measures were taken, though it would have been nice if that was because I had 1600 mean rating on Shoddy/PLab and it still wasn't enough, rather than what is going on currently. I don't think that I would have bothered with this if it weren't for the people that I watched experience completely unfair situations, or if a team based on Cynthia couldn't make a solid 1200 rating. This is a real problem that affects many people, and I know that a lot of people get that there really is something wrong - even if they don't want to admit it.