np: BW OU Suspect Testing Round 10 - Hazard

Meru · Feb 20, 2013

Anybody who says that they didn't notice the differences between the Suspect ladder and the other ladder must be really oblivious. Throughout the 80-something battles you did to get reqs, did the influx of SR Terrakion, Aero, Azelf, Mew, Froslass, Roserade, and Custap+Sturdy users not tip you off?

Shining_Latios · Feb 20, 2013

Meru said:
Anybody who says that they didn't notice the differences between the Suspect ladder and the other ladder must be really oblivious. Throughout the 80-something battles you did to get reqs, did the influx of SR Terrakion, Aero, Azelf, Mew, Froslass, Roserade, and Custap+Sturdy users not tip you off?

I never saw Aero, Azelf, Mew, Froslass, or Roserade in any of my battles in the first place.

Tassa · Feb 20, 2013

Pwnemon said:
Ferrothorn and Skarm can setup alongside.

When ? The best they can is attack, Magic Coat is a huge pain for them in other cases. (I made few match with Deo-D, but I remember this stupid Skarm who tried to setup his hazards on me, giving me the SR I lacked in my first test team)

Pwnemon said:
E @ 2 above: Zam can run encore or Taunt (why don't more?)

Magic Coat, Mental Herb.

Curtains said:
What is stopping a focus sashed terrakion to get SR up? Nothing.

Sableye : taunt turn 1, burn turn 2, recover turn 3.
Espeon is maybe KOed by Stone Edge but set you at 1 HP while preventing SR, so you cannot set SR. And this, whatever the Espeon set you run as soon you have Psychic.

Curtains said:
Also rain is dominant? I think i recently seen a stat that rain is only second to HO or non weather. So that is false.

Stat are a thing.
Analyze them properly is another.
Do you realize that "non weather" is not a playstyle with a "non-weather mon" that we see in all these team ? That non weather is what we had in 493 (bar some Rain Dance and Aboma which does real weather teams ; teams with Tyra were rarely strongly weather-oriented)
Do you say that 80% of the 493 metagame was a unique playstyle ?

To Meru, I made something like 140 matchs (I failed at my first attempt, which is funny since my winrate is similar in the 2), and I cannot judge objectively since the second time I used a Custap Forre. It even was funny when someone asked me if I was MM due to that. It's pretty effective, but can nothing against Taunt or smart play (first : don't hit too effectively ; two : kill it)
For Aero and Roserade I didn't seen any, Mew/Azelf/Froslass maybe one of each. I've seen an honnest number of SR Terrakion though (which I killed happily before spinning safely their hazard and saving Forre for later, or just have a half-life Forre without hazard up in both sides)

Jukain · Feb 20, 2013

okay I got reqs a little while ago, and I must say, there were a ton of sash rak on the suspect ladder. there was quite a bit of azelf and custap forry / skarm. there were a few crustle and mew also. people were desperately trying to replace deoxys-d, and they're all really effective. I've found them to be as effective as deoxys-d if not more, as many of them are really fast and thus can prevent pretty much everything from getting up hazards. also, kyurem-b and latias are insanely good in this metagame. latias just checks a ton of important stuff, while kyurem-b (I ran mixed with fusion bolt / earth power / ice beam / roost with max satk / some hp / some spd) can hit so many threats hard.

the suspect ladder itself was really annoying; I lost a bunch of battles right off the bat, so I ended up playing like 150+ games. all I have to say is that glicko2 is fucked up, people had reqs with worse records than I had at various points.

anyway I'm going to vote ou, deoxys-d is nowhere near broken.

EDIT: also seriously forfeiting games to lower dev? it's fine if you play the game out and think you have no chance of winning, not wanting to waste time, but forfeiting right away is dumb.

Windsong · Feb 20, 2013

I'd just like to say that I'm really disappointed with this ladder round and I feel that whatever the result, be it ban or unban for deo-d, it should be pretty much completely discounted. The system is great when we're letting players vote who have solid win ratios and rankings and clearly know the tier, but when players with 2:3 (and worse) win:loss ratios are qualifying it just completely makes the whole system kind of worthless. This was probably some error with the ladder of course, but if in the future making reqs could be based off w:l rate in addition to score somehow I'd have a lot more faith in the voting system.

Rhys DeAnno · Feb 20, 2013

Multiple OU Ladders and Their Effects on Glicko2 and the Suspect Process

Windsong raises a good point, and I've actually been thinking about the whole thing lately and I'll post up my thoughts here.

It's likely that everyone who's been laddering in the two most recent OU suspect tests has noticed the stark differences between the environments and how Glicko2 has responded to them. In the Torn-T test, a relatively consistent performance overall was required to succeed, usually with a win percentage of at least 80 or so over about 70 games. In the Deo-D test, it required a win percentage at least of 60 or so over about 90 games. The reason behind this large difference was that the standard OU ladder was closed for the Torn-T test and open for the Deo-D test.

The open OU ladder meant that most people who were not looking to qualify did not play on the suspect ladder. This obviously included most of the poor and average battlers who make up the bulk of OU's population. This had a number of obvious and subtle consequences on the suspect ladder:

The suspect ladder had much less people battling on it, which resulted in wait times for battles. These wait times were often not too bad, but could be as long as a couple minutes in off-hours.
The much smaller population of the suspect ladder meant that there was no large "heat sink" of Glicko2 to keep a normalized distribution. A large number of alts get abandoned by suspecters after a few early losses, and this causes their "negative points" to be trapped away and isolated from the system. If the population is large and few people do this it does not have noticeable effects, but if a large amount of the population does this it results in inflated ladder ratings for the entire system.
This inflation did actually not seem to be time-independent. As more and more accounts were created and destroyed the average Glicko2 rating of a ladderer became very high, probably well over 2000 by the last days of the test.
These higher ratings resulted in more spread out ratings in general, making deviation fall more slowly and extending the amount of battles needed to make +/-55.
Due to this large amount of required battles and the generally inflated ratings of the ladder, many suspect testers finished with a string of forfeits to get qualifications more quickly. Unfortunately this probably poisoned and inflated the ladder even further.

While I lack the statistical and programming acumen to easily run simulations to investigate the above effects, I think the general outline they paint is clear enough. While the above situation might or might not be bad, it was certainly much different from the situation with the Torn-T test, and I'm concerned about the validity of our testing process if we test for bans under such different conditions. I think we should probably standardize whether we turn Standard OU on or off during OU suspect testing so we have more predictable tests, and so we can adjust ladder reqs accordingly for either environment. My main concern is a rating of 2000 +/- 55 probably does not indicate the same thing during this test as it did during the last test, and drawing our suspect voters from different populations during different tests is going to corrupt our suspect process.

Deluks917 · Feb 20, 2013

The problem seemed to be that most people realized the best plan was to quit if you lost any of your first say 8 games. When basically everyone does his the avg rating will wise throughout the test. As in fact dramatically happened at the end.

It would have been much better imo to have just used regular ladder ranking for this. For one the ratings were reliable. And for another you actually would have gotten to play with Deo-D.

I mean by the end people were recommending forfeiting as a strat (I recommended playing random crap teams but that is far from a forfeit strategy).

Raising the reqs for this seriusly favors people who played at the end of the test. During the beggining of the test ratings were fairly normal. Maybe its best to just throw this test out. Though people spent alot of time trying to get deviations low.

edit: I meant I don't see a fair way to handle this test. People thought the requirements were one thing and acted accordingly.

Curtains · Feb 20, 2013

One solution is to just have the top 30 battlers with a reasonable rating minimum. Also eliminate the extra ladder so it won't take an hour to do 4 battles. Besides I can rarely see an instance where not having the suspect on the ladder creates interesting and informed discussion. This is especially true with a pokemon whose most suspecting quality is the ability to use it easily.

Cyrrona · Feb 20, 2013

The ratings obviously got really screwed up towards the end, but I think trashing the entire round would be a huge mistake. Qualifying was a significant time investment for most people, and it'd be massively discouraging and wasteful to toss all of that work out. If people are concerned about this crop of voters' metagame credentials, the council could consider requiring short justification paragraphs from those teetering on the W/L fence (which shouldn't take too long to review if we limit the requirement to those suggested). I'd much prefer any of these sorts of compromises to the drastic alternatives referenced above. The last portion of this test was far from ideal, and I'm sure we'll be taking steps to prevent similar problems in the future...but the current Round 10 is definitely salvageable.

EDIT:

Rhys DeAnno said:
I think requiring any sort of paragraphs would be a huge mistake. In the end evaluating paragraphs is always going to be a judgement call, so if we're basing quals on paragraphs we might as well just have the council making fiat decisions. I think what's done is done for this test, and we should lie in the bed we made and focus on improving results of subsequent tests.

The suspect process is filled with elements of subjectivity as it stands... I'm not advocating regular paragraph requirements, but I think it'd be a decent one-time option in this situation for verifying some of the users and clearing doubts about the vote's legitimacy. Although I understand others might be more skeptical, I trust the council would review the submissions as impartially as possible and determine (to the best of their ability) whether someone's actually an idiot or a reasonably knowledgeable player. I get the anxiety about unintentional bias, even if I think we could probably minimize it to acceptable levels in this case...we could, though, potentially safeguard against that sort of thing by adding another layer of transparency and posting the submissions publicly. Whether we use this idea, work out another solution, or leave what we've collected untouched, I'm just primarily focused on preserving what we've accomplished so far.

EDIT EDIT: Agreeing with the anti-W/L-ratio sentiments below.

Rhys DeAnno · Feb 20, 2013

Cyrrona said:
If people are concerned about this crop of voters' metagame credentials, the council could consider requiring short justification paragraphs from those teetering on the W/L fence (which shouldn't take too long to review if we limit the requirement to those suggested).

I think requiring any sort of paragraphs would be a huge mistake. In the end evaluating paragraphs is always going to be a judgement call, so if we're basing quals on paragraphs we might as well just have the council making fiat decisions. I think what's done is done for this test, and we should lie in the bed we made and focus on improving results of subsequent tests.

EDIT: More thoughts

Another big mistake is to trust W/L percentages as some kind of gospel just because Glicko2 was goofy this test. Someone could easily have faced much more difficult competition and have a justly high Glicko2 rating for a meh win percentage, or have faced lots of weak opponents and have a great win percentage compared to their Glicko2. Additionally, lots of people operated under the assumption that 2000 +/- 55 was sufficient and ended with a string of forfeits, which would obviously impact their win percentage in a very negative fashion. Really, W/L percentage is an even shittier metric of measurement than corrupted Glicko2 is, since at least Glicko2 is attempting to be intelligent and W/L percentage doesn't even try.

Response to above Edit:

but I think it'd be a decent one-time option in this situation for verifying some of the users and clearing doubts about the vote's legitimacy

It would actually make me completely doubt the validity of the process, since there has been tons of bad blood about Deoxys-D and a vast disconnect between different elements of the playerbase concerning if it should have been tested at all compared to other things like Drizzle. It'd be easy even for somebody trying to remain impartial to have their judgement of the paragraphs twisted by their position, especially in the borderline competent cases. If the reviewer is even slightly less harsh on one side or the other that could completely skew the vote.

Reymedy · Feb 21, 2013

Windsong said:
I'd just like to say that I'm really disappointed with this ladder round and I feel that whatever the result, be it ban or unban for deo-d, it should be pretty much completely discounted. The system is great when we're letting players vote who have solid win ratios and rankings and clearly know the tier, but when players with 2:3 (and worse) win:loss ratios are qualifying it just completely makes the whole system kind of worthless. This was probably some error with the ladder of course, but if in the future making reqs could be based off w:l rate in addition to score somehow I'd have a lot more faith in the voting system.

I don't understand it.

People who have a shitty W/L ratio in the end met the reqs for a simple reason : they had an almost perfect W/L ratio at the start.
So it's too easy to trash them and say that they suck in the metagame for this reason.

Want an example ? I have a terribad W/L ratio this suspect and you know what ? I don't care the slightest and I think I know this metagame enough to vote. I had so much glicko² at one point and the deviation was lowering so slowly that I did not give a single duck to my games. It was because I was flawless in the start that I could afford that.
I think it's unfair to look at my W/L in the end and say "okay he has a bad opinion on this metagame, he's probably bad". I played 110 games seriously, and my W/L is horrible because I had 2.5K glicko² at 90 deviation and I was like "Okay screw this, I don't care anymore, let's play gimmicks or whatever because anyway I just need to lower my deviation now".

So yes, too easy to stare at us from your "good ratio". One could say that your ratio was good because you faced bad players, and could argue that his shitty W/L is due to the level of the players he played against himself.
Improve the ladder ? Force people to not run dozen alts, is the only possible solution. But this will never be done.
No "improve W/L, glicko² etc.." nonsense please, else I'm just gonna make another alt, with a perfect W/L ratio and everybody will do the same. In the end we'll play between people at 3K Acre or just wait the perfect win succession.
And W/L means nothing, imagine that on the ladder, there are only 10 of the best OU BW players on smogon right now. They all "deserve" to vote, but what will be their W/L ratio ?
You see what I mean I guess, W/L depends on who you meet. And in every game, there is a loser, does that mean this loser wasn't good ? No, you can't say that because you don't know his level just by watching his W/L and I don't even understand how you came to that shortcut.

Last suspect I had a good W/L ratio, one of the best iirc I did cry about the number of games I had to play, I did not trash the other people doing the suspect because they had a less good ratio because I have no clue who they had to play against.

TL;DR : Don't dare take away the vote from me after 110 games or I go mad.

EDIT : I thought it was a ratio of 2 OR 3 W for a L that you were talking about.. oh well it's true that 2 wins for three losses is beyond my expectations of what is a low ratio.. whatever x)

Sacaen · Feb 21, 2013

One issue you're forgetting about having alot of people string forfeits at the end of their run to get requirements is that it boosts other people up flawlessly. (and I'm not saying you or anyone in particular did this Remedy)

Because the ladder was so empty it really was not hard at all to simply view who was currently doing matches on the ladder and if they were chain forfeitting you could queue up with them, and get a chain of essentially guaranteed wins. I didn't have the time (or care, I'm pretty neutral on Deo- D's situation) to actually get req's myself, but I know for a fact how manipulable it was possible to be, as I was able to do this just by watching some suspect matches, easily figure out that they were chain forfeitting every time they got a match, and then queue for a suspect match and get a free win. It should be put into question how many people who got requirements got there with the (even unintended) help of other people being on at the same time forfeitting to lower their deviation? 'Broken' describes it quite nicely.

The situation above really should not be allowed to exist. Ladderers shouldn't be put into a position where chain forfeitting is what they need to do to get req's in a timely fashion as it can corrupt others' ratings.

Iconic · Feb 21, 2013

fyi people with solid win ratios but whose deviations can't be lowered enough due to inflated ratings are almost always accepted as special applications if they have played enough games, and nearly all people who qualify with sub 50% win ratios are due to throwing games because they're too lazy to lower deviation legitimately, so i'm not sure where this 'disappointment' is coming from lol

aldaron and i were talking a couple of days ago about implementing a minimum 50% win percentage to discourage people from throwing games to lower deviation, because as heist pointed out there exists this culture of losing on the ladder for the sake of meeting reqs. i think jabba actually brought this up last round but it completely slipped my mind until a few days ago. ratings have the tenancy to get really inflated on the suspect ladder towards the end of the test, making it hard to lower deviation, but as i said before that's why we have special apps. glicko is not perfect but it's certainly not as bad as some of you think (i'd like to see you guys devise an algorithm for measuring skill in such a luck-based game!!). these details will certainly be hammered out before the end of the next test

Jayde · Feb 21, 2013

What I don't like about the ladder system is how easy it is to hit reqs. You honestly just have to get lucky with ratings for the first ~10 matches, and after that you can make reqs by forfeiting over half of your remaining matches. With this ladder, almost anyone can get reqs if they put some time in. Excessive forfeiting also gives ladderers many undeserved points. During my forfeit spree, I gave at least 10 wins to at least 2 people, one of which ended up barely hitting reqs. I understand that this issue is the council's to deal with, but I doubt that this is what they intend. I'd honestly put set a win ratio for reqs

I'd honestly set a win ratio for reqs, maybe something around 2:1 or 5:3 (with a set number of minimum battles). This would put a limit on the number of forfeits as well as the number of undeserved voters. I know that the difficulty of reqs is the council's to decide, but this is just my take on the matter, and I doubt that they want reqs to be this much of a joke either.

dice · Feb 21, 2013

i still had over a 50% win ratio and i threw ~20 matches

the problem with the current ps system is the inflation toward the end of the round if we were going to move to a higher glicko2 score to get reqs. if you ladder at the beginning of the round, you're going to be facing lower kiddies on the ladder and getting a higher glicko2 is much harder than at the end of the round. when we laddered on PO the ladder score is much more stabilized unlike PS since it doesn't inflate nearly as bad since people can set a ranking variation. additionally, i understand deviation is so you play a certain amount of battles, but seriously it's just a hastle from 65 --> 55. if we were going to try and improve the ladder score system, i'd say 21(50?) glicko2 and 65 dev.

Deluks917 · Feb 21, 2013

Honestly is there any evidence GLICKO is more stable then regular ELO with a reasonable K value.

I also support 2150+-65 GLICKO2 if we are going to stick with gLICKO2.

Though again this heavily favors people who play at the end of the suspect test.

This problem was much less severe during the garchomp and Tornadus Tests. Maybe our userbase for this test was too small for the system to handle.

I think ELO with constant (potentially fairly large) K value should be considered. Theoretically this reduces the rate at which you attain your true skill. But it has the advantage tht your first and last matches affect your rating fairly equally.

To be exact if K is 25 and I go on a new account and beat EO I would get almost 25 points. However later after 70 battles I battle EO again and we have similar ratings I would gain/lose 12.5 points. The actual number of games I have played does not affect anything. The only downside to this is GLICKO cushions ratings lose when an established account loses to a new random person (they could be Heist or whateveR). But in ELO with fixed K you can lose at most K points anyway. And its not like losing to new players doesn't hurt in GLICKO. This is I suppose not exactly equal but it is nothing like GLICKO. I remember beating some high ranked guy (Volta?) on my 2nd game on one account and I instantly shot to like 2700 GLICKO2. This sort of thing does not happen with ELO.

Another fairly ridiculous fact in GLCIKO is that if you can manage to play high rated players in your first battles due to the server being depopulated you are way better off. Say you have a real skill of 2200. If you play people with real skill 1900, 1950, 2000, 1800, 2100 in your first 5 battles you ar dramatically better off then if you play people with 1500, 1600, 1700, 1800, 1900 (assuming equal deviation).

Rhys DeAnno · Feb 21, 2013

Jayde said:
I'd honestly set a win ratio for reqs, maybe something around 2:1 or 5:3 (with a set number of minimum battles). This would put a limit on the number of forfeits as well as the number of undeserved voters. I know that the difficulty of reqs is the council's to decide, but this is just my take on the matter, and I doubt that they want reqs to be this much of a joke either.

The problem with this is you wait until a bunch of idiots are on the ladder and play then to get a high win ratio. We're using Glicko2 for a reason: it's because using W/L record is completely stupid. Remedy makes an excellent point that if you happen to be laddering when all the usual suspects for OU are laddering some people are going to have a less than 50% win rate, but Glicko2 is supposed to account for this by judging the degree of difficulty.

Maybe the system could judge when ladder accounts have been abandoned and correct for it with some kind of negative bonus pool? It's the process of trashing accounts that is confusing Glicko2, so the fix is going to be something to do with recognizing the trashed accounts and compensating. Also during last round the situation seemed less severe, probably because regular OU was shut down and we had all the non-serious players behaving normally and grounding the system. I think the first precaution we should take if we're interested in fixing things is not to run two OU ladders at once anymore, to take advantage of the heat sink our casual players provide Glicko2.

MMII · Feb 21, 2013

Except that the accounts are trashed because your first 15ish battles are the key ones and if hax screws you over on any of them you are likely to have a harder time reaching a higher peak as the first battles count largely more than the last ones. I don't think we should further punish the ladders that are already annoyed about having to start completely over from scratch and create a new alt name. I'm liking the sound of this ELO system as its this massive difference between the early and late games that are pushing people to create new alts and go on forfeit sprees at the end in the first place.

Woodchuck · Feb 21, 2013

Really, suggesting ELO? ELO was essentially the rating system we used on PO, and it was garbage. You could easily get haxed out of reqs because you would lose such tremendous amounts of points to poor luck against lower ranked players, wasting the efforts you had made for hours before. You'd spend many battles with +1 -30 differential, and losing just one to hax sets you back 30 battles. ELO is alright for chess, but in a game with such a luck aspect as Pokemon, it is grossly inadequate for the job. There are other ways we could set reqs requirements, but going to ELO is a step backwards.

I also have a problem with the win ratios idea. With an ideal rating system that perfectly assigned each person their skill level, everyone would have near 50% win ratios. The point of the rating system is to match you up against players of roughly equal skill, so in evenly matched games, you should be winning roughly half the time. The only people with disparate win ratios should be those at the very very top and the very very bottom -- but this clearly isn't happening. Either way, the fact that we have a rating system placing people on the ladder at all makes win ratios a horribly inaccurate way to evaluate player level.

Win ratios were never designed to be an adequate measure of skill in a laddering situation; attempting to use them for any meaningful purpose is a bad idea.

Cyrrona · Feb 21, 2013

For what it's worth, I think the simplest/least disruptive short-term solution is something like this:

1) Revert to our one-ladder system instead of splitting OU and Suspect
2) Set the qualifying benchmark at 2000 +/- 65

Like others have noted, the larger playerbase should be able to offset any ripples that certain laddering practices might cause. On that note, I'd expect the number of "strategic forfeits" to fall significantly with this modest (but much more manageable) deviation change. I don't see any real need to raise the actual rating requirement if this shift can mitigate the inflation problems...we've successfully run a handful of single-ladder tests with thresholds of 2000 in the past, and I personally don't think raising the minimum dev by 10 points would have any noticeable impact on voter quality.

Deluks917 · Feb 21, 2013

One ladder would basically eliminate most of the problems.

Though I would still advocate going to 2050 or 2100. But seems like there have been concerns when not enough people voted.

Jayde · Feb 21, 2013

Rhys DeAnno said:
The problem with this is you wait until a bunch of idiots are on the ladder and play then to get a high win ratio. We're using Glicko2 for a reason: it's because using W/L record is completely stupid. Remedy makes an excellent point that if you happen to be laddering when all the usual suspects for OU are laddering some people are going to have a less than 50% win rate, but Glicko2 is supposed to account for this by judging the degree of difficulty.

I don't get what you mean by "wait until a bunch of idiots are on the ladder". It's not like the skill level of the ladder really fluctuates throughout the round.

Also, I don't think you understood me correctly. I'm not saying that we should scratch Glicko2, I'm just saying that we should incorporate a win ratio as well. Using a rating system solely based on W/L wouldn't be ideal, but clearly, neither is the Glicko2 system when people with awful win ratios or 40+ forfeits are hitting reqs.

Rhys DeAnno · Feb 21, 2013

Jayde said:
I don't get what you mean by "wait until a bunch of idiots are on the ladder". It's not like the skill level of the ladder really fluctuates throughout the round.

The skill level of the ladder fluctuates massively depending on the time of day you're playing. I have seen so many ridiculous gimmicks from silly players in late/early morning EST on the OU ladder that I'm probably permanently traumatized. Win ratio is completely useless as a statistic in a metagame with such disparate skill levels, you might as well base qualifications entirely on chance.

Lavos · Feb 21, 2013

i played somewhere like 90+ games with a glicko2 of above 3000 the whole time, and my deviation is still in the high 170s because of how broken the ladder system is on showdown. basically, if you have a really high rank, your deviation goes nowhere. it took like 10 battles to lower it from 190 to 189.

i don't know the solution, but i know the problem

Hugin · Feb 22, 2013

If people are forfeiting games to lower their deviation, then that's clearly a problem. And not, let's be clear, a problem due to player action. Winning should always be the best strategic move, if it isn't, then the game is broken, the players are just playing it as best they can.

Creating rules to punish forfeits or require a certain w/l ratio is just a bandage, the problem is how deviation is calculated. If Glicko2 would ignore battles after you have a certain number of additional battles, that would probably fix some of the issues. Players discard accounts because there's no way to recover from a single loss to someone at 1700, and the discarded accounts cause the inflation. Give a way to recover from early hax, and situations like Lavos' don't happen. Stop the inflation, and getting to 2200 actually means something.

np: BW OU Suspect Testing Round 10 - Hazard

ate them up

!_!

stumbling down elysian fields

Slacking Off

Ride on Shooting Star

starlet

Slacking Off

ne craint personne

Ride on Shooting Star

Slacking Off

actual cannibal

starlet

Ride on Shooting Star

Slacking Off

Banned deucer.

Users Who Are Viewing This Thread (Users: 1, Guests: 0)