Ubers suspect testing aftermath

dice · Nov 19, 2014

utl -

8-BIT Luster · Nov 19, 2014

I feel like while the suspect test were a bad idea in general, and there was bias in the acceptance of votes, Fireburn/MM2 are not entirely to blame.
Fireburn is a good guy, and although he made a mistake, owned up to it and is by no means corrupt or intended to suppress the will of the people or whatever.
Melee read a lot of the votes on the original Gengarite Supsect, and that didn't get banned. He, too, clearly was trying to be as just as possible. (I know that the Shadow tag Suspect is far more controversial, but it still shows the restraint MM2/Fireburn can demonstrate.)
EDIT: The whole test was a bad idea, as it just led to divisive "partisanship." with the abolishment of suspect testing, I'm assuming that there will be another way to determine whether or not something should be banned. The ubers tiering council will probably hold an executive vote to determine whether something will be banned.
I'm glad that Fireburn owned up to his mistake, and it takes great moral courage do so.

WebBowser also brought up an interesting suggestion in the posting of high-level replays showing a team taking advantage of the ability in question. However, there's always the possibility of low damage rolls or crucial misses that might make it hard to find a suitable replay. It's a good suggestion, however, and is worthy of looking into.

The whole debacle was one of human nature, and try as we may, we can never escape that.
Jibaku/Melee/Fireburn are not little gremlins that are out to change your vote and throw the world into chaos or whatever, they're just people, and in the case of the latter two, people whose bias may have factored in the rejection of certain paragraphs.

In conclusion, Team Fireburn lgi

AM · Nov 19, 2014

WebBowser's idea is nice on paper but replays wouldn't actually do much considering someone could end up playing 100 terrible matches and get a lucky one to showcase ones ability and it would still be subjective. I was discussing this with some people such as brokenwings and he brought up a good point if and when suspect tests were to ever take place in Ubers you would be better off setting the criteria at a certain GXE after a certain number of battles or more. Majority of the votes fell under the 80 GXE range anyways and ladder isn't the greatest of tools to define ones knowledge and understanding of a metagame. However, if you wanted something objective GXE in correlation to battles can work cause it'll show who's a consistent enough player to have a general understanding and who isn't. Obviously you would set it to whatever would seem appropriate of course but if something similar to this ever takes place again that's something to consider.

HeIIraiser · Nov 19, 2014

keep calm and continue using the ubers formula:
Xerneas / Arceus / Mega Gengar / Palkia / Klefki / SR user
-outl

asterat · Nov 19, 2014

My first idea here is to somehow edit the coil system. I mean, maybe we can isolate some factors of high level play that there is no substitution for. For example, what if the games were ignored until the player reached a GXE of maybe 75 and then the next 20 games were counted and the player has to win 90% or something. These numbers are pulled out of the sky based off of personal experience. Or, maybe we could simply base suspect tests off GXE or some altered form of it. Setting a coil ranking enforces a minimum GXE, and with ubers it's really low, I think it's about 60. This means that anybody that dicks around on the ladder for long enough can get reqs. Making a high coil forces players with good GXEs to spend as much time as those low players with a low coil. So, maybe if a GXE is a certain number a player gets reqs. Perhaps coil could also be changed so that after game x, it no longer changes. This enforces a minimum GXE without forcing players to play ridiculous amount of games. Lots of ways to edit it so it serves us better, these are just some ideas.

WebBowser · Nov 20, 2014

AM My suggestion was for a replay to be used in addition to the COIL/GXE/ELO/whatever-the-heck-Antar-comes-up-with-next requirements. I'm sorry if I was unclear earlier. Obviously if all it took was 1 good replay to get voting reqs, it would be easy to do just as you said, screw around until you just happened to get a good replay. However to get ladder reqs AND get a good replay? That's harder to fake.

That's not to say my idea is perfect. For starters, one could get a replay first, then swap to some random team that he's more comfortable with and proceed to wreck ladder from there, but I'm personally ok with that because the req still accomplished it's duty. It forces the voter to actually experience the uncompetitive element firsthand. I feel that in order to understand any uncompetetive or broken threat, you need to actually use it in battle (that's not to say that the converse is true, but meh).

FlyingIsOP · Nov 20, 2014

If you guys thought Stag was bad. You guys got another thing coming.

Thugly Duckling · Nov 20, 2014

FlyingIsOP said:
If you guys thought Stag was bad. You guys got another thing coming.

Lesnarquaza fuck yeah

faint · Nov 20, 2014

Would just like to chime in and say that I believe all votes should be visible... Not sure if this was addressed in an earlier post. I really cba to read all these replies.

Also, as I said on IRC, I am deeply saddened that Melee Mewtwo is no longer a mod. I mentioned just last week that I considered him to be a great leader. I personally consider him to be the best staff member Ubers has had in a very long time (probably since D/P Jibaku). I have faith in Sweep, though, so I'm sure we will be fine.

chaos · Nov 20, 2014

FlyingIsOP said:
If you guys thought Stag was bad. You guys got another thing coming.

lol this is so true

faint · Nov 20, 2014

Also: like this post if you want evasion banned.

hausdog · Nov 21, 2014

FlyingIsOP said:
If you guys thought Stag was bad. You guys got another thing coming.

focus sash cloyster says hi

GlassGlaceon · Nov 21, 2014

hausdog said:
focus sash cloyster says hi

Please tell me you're joking

trikx_insane · Nov 21, 2014

hausdog said:
focus sash cloyster says hi

Hazards say Hi as well.

G-Von · Nov 21, 2014

I've read through this thread and the entire time I've been wondering "whose idea was it to appoint two openly biased mods with the same opinion to read the paragraphs?" Sure there was someone with no bias in one direction or the other, but just think about it when you watch the news and there's a discussion on a matter. Do you really think that one person's unbiased thoughts are going to have any sort of impact on the other two? Chances are the person who has no stance on this issue in either direction will get crowded out by the other two. I believe this was the case in the stag suspect testing review where MM2 and FireBurn were pro-ban and Jibaku was neutral on this matter. Why couldn't either one of MM2 or FireBurn been replaced with someone like shrang, who is a trusted user and was on the anti-ban side of the argument (based on what I read in Jibaku's post). This way there would have been someone who was pro-ban, anti-ban, and neutral on the matter of Shadow Tag and would have definitely yielded a more accurate depiction of the suspect test. This has left me completely dumbfounded that the situation was handled the way it was.

In the future, I really don't think raising the reqs will do much in weeding out the less skilled and informed players from the skilled and informed ones, but simply using common sense in choosing who to do the reviews of the votes would aid this issue tremendously. If I followed the suspect thread and heard who was doing the reviewing, I would've said fuck it they're gonna ban ST regardless of what I do.

faint · Nov 21, 2014

G-Von said:
I've read through this thread and the entire time I've been wondering "whose idea was it to appoint two openly biased mods with the same opinion to read the paragraphs?" Sure there was someone with no bias in one direction or the other, but just think about it when you watch the news and there's a discussion on a matter. Do you really think that one person's unbiased thoughts are going to have any sort of impact on the other two? Chances are the person who has no stance on this issue in either direction will get crowded out by the other two. I believe this was the case in the stag suspect testing review where MM2 and FireBurn were pro-ban and Jibaku was neutral on this matter. Why couldn't either one of MM2 or FireBurn been replaced with someone like shrang, who is a trusted user and was on the anti-ban side of the argument (based on what I read in Jibaku's post). This way there would have been someone who was pro-ban, anti-ban, and neutral on the matter of Shadow Tag and would have definitely yielded a more accurate depiction of the suspect test. This has left me completely dumbfounded that the situation was handled the way it was.

In the future, I really don't think raising the reqs will do much in weeding out the less skilled and informed players from the skilled and informed ones, but simply using common sense in choosing who to do the reviews of the votes would aid this issue tremendously. If I followed the suspect thread and heard who was doing the reviewing, I would've said fuck it they're gonna ban ST regardless of what I do.

MM2 wanted to avoid bias by asking anti-ban users to have a say in the paragraphs, but Hugen didn't believe that was necessary and trusted their judgement. There was going to be bias regardless of how the vote went, though. If MM2 did not believe a no-ban vote was qualified, had I read it, I likely would have saw otherwise as I was also no-ban.

oookillemd-tan · Nov 21, 2014

chaos said:
This is also the reason that UU works differently than all of the other tiers..

I've never really looked into it. How does it work, may I ask?

Celticpride · Nov 21, 2014

oookillemd-tan said:
I've never really looked into it. How does it work, may I ask?

The tier leader bans anything he thinks could potentially broken. Then, they retest the banned mons one by one to see what's really broken. After a testing period, a council of around 10 or 12 vote if the suspect in question should be banned or not. Not everything gets sent back to BL, things do get to stay in UU (Mega Zam, Haxorus, Mega Doom and Hydreigon got to stay). The main plus is it gets the really broken things out quickly. The downside is you could potentially ban a whole group that balances each other out, and thus makes the tier balanced as a whole.

Link to tiering policy:
http://www.smogon.com/forums/threads/xy-uu-tiering-system-voting-records.3520708/

xJownage · Nov 22, 2014

In the future, if there is a paragraph requisite or something of the like, the less skilled need to be weeded out mechanically, not by a human. Any human has his own pet peeves or views on topics, so they are likely to read certain ideas and be more tempted to throw it out based on it being something they disagree with. The voting council would have only been able to say they were not objective if they were able to ignore all of their own opinions of flawed logic. They can't read something, disagree with it in their own opinions, and then say that is not a worthy argument. Even if that opinion holds no truth, it is the opinion of a player who was good enough to get the requirements supposedly necessary to vote. To suggest that they are not intelligent enough about the tier isn't just insulting to them, but completely defeats the purpose of making requirements in the first place.

That being said, I feel it is very condescending to throw out votes of those who got a requirement to do something because the people tallying the votes feel that their logic "isn't good enough" for them to have a say in the decision. The only way to filter is to increase the difficulty of the prerequisite. Using a human filter is just stupid. Anybody with the knowledge in the tier to be called upon in making such a massive decision is going to have some sort of bias, and is also going to have a certain argument that if they see would make them automatically want to reject it due to their own disagreement.

The system of the previous suspect vote was this: spend hours of your time to get the prerequisite, and if you do we will consider your vote.

Honestly, I didn't even deem the suspect test worth my time from the beginning because I knew there was no way that the vote filtering would not be objective, and I knew that there was a good chance that whatever the top ~15 most influential players wanted would happen.

The only way to have a filter in any way is something determined in battle or systematically. If it is based off of somebody's judgement, this problem will arise every time. The only ways to do this are implementing a battling system (you have to somehow "prove your worth" in the tier) or just increasing the reqs. Changing the reqs means either making them higher and/or possibly changing the way they are calculated, i.e. those who do like 300 battles can't just get in because they were okay. That is my #1 problem with the coil system; if you do enough battles you will almost always end up gaining the reqs necessary, and you have to expend massive amount of times to get high enough no matter how good you are. A new system would need to work so that it weeds out the high level players from mediocre ones (using GXE with a certain number of battles needed is an idea). Another idea is to make it into a point system, such as tourney wins counting for a certain number of points dependent to the ladder values of the players in the tourney as well as the number of players, as well as win%, GXE, Coil, AND ELO on the ladder.

There are ways to make it work, but human filters was doomed from the beginning and will never work. Proof it will never work? Where has a system like this every worked anywhere in society?

Suspect testing in ubers is not the problem (although I personally think that we cannot suspect anything that would ban a pokemon as a whole), suspect testing using humans as a filter for mediocre players was the problem.

WebBowser · Nov 22, 2014

xJownage I agree with the need for a more objective standard for suspect reqs. Out of curiosity, what do you think of my idea of requiring a high-ish level replay of the suspect in question in addition to normal reqs? I honestly think that the TLs had the right idea in that simply good at laddering isn't enough to be able to determine whether or not something is uncompetitive, but they just need a more objective way to understand it. I asked myself this when I came up with the idea: "Do I want someone who has never used mega mawile or aegislash or baton pass making the decision on whether or not it belongs in the tier?"(those were suspects I personally participated in). I honestly believe that you simply cannot understand what a pokemon or move does unless you actually use it yourself, even if it's "obviously broken/uncompetetive" like OHKO moves, evasion, or Mega Salemance (for OU, mega dragon should be fine here obv). This goes double for more subtle aspects of the metagame whose potential power/uncompetitiveness is much more subtle like Deo-S or Shadow Tag.

As for COIL, I highly recommend taking up a discussion with Antar, but allow me first to stage a brief defense of the rating system. COIL is a logarithmic function(this one to be exact: C=40*GXE*2^(-B/N)) based on number of battles and your GXE. For a given GXE X, COIL will hit an asymptote of approximately 40 * GXE after about B * 10 battles (e.g. for a GXE of 60 and a B of 40, it will take about 400 battles to hit 2600, give or take 50), and for all intents and purposes stops rising there. Given how a logarithmic function is shaped, a higher GXE will allow you to surpass this asymptote much, much faster.

The main difficulty of the COIL system is that it requires TLs to essentially predict the "baseline" GXE for players they deem competent enough to vote. It is a well known fact that the longer a ladder is up between GXE resets, the more the average GXE/ELO will inflate as more players start playing the ladder. Most suspects last about 2 weeks, but this one lasted for far longer, making it reasonable that the uber TLs underestimated the COIL needed to distinguish "competent players". Compounding this is the fact that the ubers ladder is notorious for attracting bad players, which magnifies this issue even further because there is a fairly solid base of really awful ladderers to further inflate the average GXE. This issue can be adequately resolved by simply raising the COIL requirement and/or reducing the suspect test length.

Addressing your second issue is a bit harder though, as you're basically saying that it takes too long for skilled players to hit the GXE requirements. This would imply that the constant B was set too high by the TLs (lower B means fewer battles needed to asymptote). Lowering B would basically encourage players to not stay on any one alt too long and switch alts often to try and get the win streak needed to obtain reqs, while making it too high brings about the problem you mentioned. Obviously, the length of the suspect test also needs to be considered when determining this value.

So yeah, that's the COIL system in a nutshell, along with the cause of the issues experienced by players and the methods for fixing them (please please please yell at me if I got any of that horribly wrong). I don't think there's anything wrong with COIL itself, but it does require TLs to both have a good head for mathmatics AND have a good idea of what the suspect ladder is going to look like. Setting all the right variables is pretty difficult to the point where it's borderline arbitrary, but that's probably going to be an issue no matter what rating system you use.

Proof that COIL is perfectly capable of weeding out lesser players:

Antar's (probably better) explanation of COIL: http://www.smogon.com/forums/threads/coil-explained.3508013/

ApplepieFTW · Nov 23, 2014

WebBowser said:
xJownage I agree with the need for a more objective standard for suspect reqs. Out of curiosity, what do you think of my idea of requiring a high-ish level replay of the suspect in question in addition to normal reqs? I honestly think that the TLs had the right idea in that simply good at laddering isn't enough to be able to determine whether or not something is uncompetitive, but they just need a more objective way to understand it. I asked myself this when I came up with the idea: "Do I want someone who has never used mega mawile or aegislash or baton pass making the decision on whether or not it belongs in the tier?"(those were suspects I personally participated in). I honestly believe that you simply cannot understand what a pokemon or move does unless you actually use it yourself, even if it's "obviously broken/uncompetetive" like OHKO moves, evasion, or Mega Salemance (for OU, mega dragon should be fine here obv). This goes double for more subtle aspects of the metagame whose potential power/uncompetitiveness is much more subtle like Deo-S or Shadow Tag.

As for COIL, I highly recommend taking up a discussion with Antar, but allow me first to stage a brief defense of the rating system. COIL is a logarithmic function(this one to be exact: C=40*GXE*2^(-B/N)) based on number of battles and your GXE. For a given GXE X, COIL will hit an asymptote of approximately 40 * GXE after about B * 10 battles (e.g. for a GXE of 60 and a B of 40, it will take about 400 battles to hit 2600, give or take 50), and for all intents and purposes stops rising there. Given how a logarithmic function is shaped, a higher GXE will allow you to surpass this asymptote much, much faster.

The main difficulty of the COIL system is that it requires TLs to essentially predict the "baseline" GXE for players they deem competent enough to vote. It is a well known fact that the longer a ladder is up between GXE resets, the more the average GXE/ELO will inflate as more players start playing the ladder. Most suspects last about 2 weeks, but this one lasted for far longer, making it reasonable that the uber TLs underestimated the COIL needed to distinguish "competent players". Compounding this is the fact that the ubers ladder is notorious for attracting bad players, which magnifies this issue even further because there is a fairly solid base of really awful ladderers to further inflate the average GXE. This issue can be adequately resolved by simply raising the COIL requirement and/or reducing the suspect test length.

Addressing your second issue is a bit harder though, as you're basically saying that it takes too long for skilled players to hit the GXE requirements. This would imply that the constant B was set too high by the TLs (lower B means fewer battles needed to asymptote). Lowering B would basically encourage players to not stay on any one alt too long and switch alts often to try and get the win streak needed to obtain reqs, while making it too high brings about the problem you mentioned. Obviously, the length of the suspect test also needs to be considered when determining this value.

So yeah, that's the COIL system in a nutshell, along with the cause of the issues experienced by players and the methods for fixing them (please please please yell at me if I got any of that horribly wrong). I don't think there's anything wrong with COIL itself, but it does require TLs to both have a good head for mathmatics AND have a good idea of what the suspect ladder is going to look like. Setting all the right variables is pretty difficult to the point where it's borderline arbitrary, but that's probably going to be an issue no matter what rating system you use.

Proof that COIL is perfectly capable of weeding out lesser players:

Antar's (probably better) explanation of COIL: http://www.smogon.com/forums/threads/coil-explained.3508013/

ou's coil is waaay higher (like actually difficult for non-top players), and they have a ladder thats actually halfway serious. our ladder is, "of lesser quality". raising coil wont improve anything. the only thing it will do is add an extra two hours to a very easy grind. the players that its supposed to "weed out can just steal a sample team or something and play at times no good players are on. it would just make it so they have to spend more time. paragraphs are probably the best thing to have happened all suspect.

WebBowser · Nov 23, 2014

ApplepieFTW So basically what you are saying is that any mediocre player can obtain any GXE he/she desires simply by stealing an RMT and laddering at a certain time of day? For any given COIL, there is a minimum GXE needed to obtain it, or else you simply will not reach it period.

ApplepieFTW · Nov 23, 2014

WebBowser said:
ApplepieFTW So basically what you are saying is that any mediocre player can obtain any GXE he/she desires simply by stealing an RMT and laddering at a certain time of day? For any given COIL, there is a minimum GXE needed to obtain it, or else you simply will not reach it period.

im saying that they can easily spend an extra hour on this bad ladder. in our situation, coil does not equate to not only skill, but most certainly not knowledge of the suspect.

WebBowser · Nov 23, 2014

ApplepieFTW said:
im saying that they can easily spend an extra hour on this bad ladder. in our situation, coil does not equate to not only skill, but most certainly not knowledge of the suspect.

Hmmmm... There seems to be two separate issues here. One can be knowledgeable of a suspect without being skilled and vise versa. I agree that COIL alone has very little to do with someone's knowledge of the suspect, as one can easily make a competent ladder team that doesn't involve whatever is being suspected and as long as said suspect isn't obscenely overused on the ladder (ala OU Genesect), will probably do reasonably well even if they have zero knowledge of how the suspect works. This is why I proposed requiring voters to submit a replay showing the suspect being used in a high-ish level replay along with their vote a few posts back.

However, I will say this. If COIL is unable to adequately determine a player's skill on the ladder, then no rating system will be adequate without some serious adjustments to the ubers matchmaking system. I find this to be somewhat hard to believe, but I am no ubers expert, so meh.

xJownage · Nov 23, 2014

To begin with, that is why I said that the COIL system may not be the best idea. The fact that mediocre players can expend enough time and get that value is absurd to me. A streak is unnecessary to gain a ranking I would want to talk about, my main thing is judging the actual battlers skill in a way that doesn't go up just because they do more battles.

I never said that good players didn't have the time, what I implied is that bad players who have a ton of time can still get the reqs, which defeats the purpose of it in the first place. If a COIL ranking was set too high, some really good players who don't have hours and hours of time on their hands (or are unwilling to devote that much time) may not get to vote either. I was probably good enough to get reqs and I have very good knowledge of the tier, but I did not deem it worth the time and effort I would have to put forth because I didn't want to do like 80 battles or something (my gxe was around 80 after about 15-20 battles) to gain reqs. Meanwhile, Some people who gained reqs had GXEs of less than 70, but because they expended so much time they still managed reqs. What I am trying to say with this hideous ramble is that the COIL system is inherently flawed in the fact that it rewards players for having very, very high numbers of battles.

GXE with a minimum number of battles wouldn't work because of streaks, obviously, and the fact that people would repetitively work to attempt to get the streak. At least, this is what you say. But, in my opinion, as long as the minimum number of battles (say, 50) isn't too low and the GXE is high enough, it will still only reward good players. While the mediocre group may be able to get 80~ish GXE in about 20 battles, sustaining it is unlikely if they do not have the knowledge of the tier. Maybe higher level players have to reset using alts as well, but is that really a bad thing? It still will help us show the actual skill of the players rather than the amount of time they have (no mediocre players are sustaining 80 GXE for over 50 battles) on their hands.

The point is that whatever filters have to be done electronically or mechanically. Anything judged by humans will be flawed for obvious reasons, and having a "balanced committee" doesn't really fix this due to individual characteristics that will never mash up properly. If anybody has a view of the argument at all (whether or not neutral or favored to one side) they will have judgmental characteristics that will prevent them from making truly accurate assumptions about a player's actual knowledge.

Electronic filters were the whole point of using the COIL system, and to some extent, using suspect ladders in the first place. They created a numerical way to determine a player's skill. Unfortunately, there is no practical way to determine somebody's knowledge, hence we still have an issue now. The only actual way to do this is some sort of system where somebody makes a team or something and then people have to show its weaknesses to display knowledge of the tier, although this would be a very shallow system and thus probably not work. Some sort of point system like I suggested earlier is an idea, as well as potentially using matches against "proven" members of the ubers community.

Regardless, some system must be used that is not human judgement, otherwise we will be facing this same issue again sometime down the road.

I would be somewhat opposed to a "council" deciding for an entire tier, it is just not that simple. As MM2, Fireburn, Shrang, and several others proved, when higher level players decide on something they are very adamant. Not only that, but the community, even higher level players, have little to no influence over them. The problem with this is that it puts the entire tier in the hands of just a few people and really nobody else due to that lack of influences. This would cause problems for obvious reasons, and would throw the entire tier into chaos because of change (if this had been the case from the beginning people would accept it, but it isn't). The reality is if the council decided to ban something tomorrow, even if they have good reason to classify it as uncompetitive, I and many others would be suitably outraged at the lack of communication and the stranglehold just a few select elites have over the tier. While chaos has reiterated that ubers suspect tests are a bad idea, just having a few elites in control of the tier will lead eventually to bans, as the upper echelon of the community seems to feel otherwise. Jikabu also stated that the older suspect tests were geared towards "unbanning things" and ensuring "everything that is banned has good reason to stay there," but then said that somewhere the intentions were twisted into using those to justify a "classic suspect test." The only reason that a ban on pokemon was even considered was because of the previous bans, which were unbans and not bans, and were not supposed to be any kind of justification for a standard suspect test. The point is, I don't believe ubers should be geared towards banning anything that does not create CPU generated 50/50s (i.e. evasion/rng related stuff) and this is the primary view of the community, but a small group of people are thinking about ignoring this and trying to ban things against the wishes of the community (i.e. what would happen if we had a council) and apparently have the power to do so without our approval. A council would only rip apart the small portion of the upper level ubers community from all the other players, skilled and noobish, and could ultimately result in the demise of the tier's popularity.

Ubers suspect testing aftermath

dice

8-BIT Luster

Completely Unviable

AM

HeIIraiser

tough like igglybuff

asterat

WebBowser

FlyingIsOP

Banned deucer.

Thugly Duckling

I play TCG now

faint

Banned deucer.

chaos

Tournament Banned

faint

Banned deucer.

hausdog

GlassGlaceon

My heart has now been set on love

trikx_insane

G-Von

faint

Banned deucer.

oookillemd-tan

Celticpride

xJownage

Even pendulums swing both ways

WebBowser

ApplepieFTW

WebBowser

ApplepieFTW

WebBowser

xJownage

Even pendulums swing both ways