[Proposal] A new way of testing suspects

guoguo · Dec 7, 2008

I feel that the current system for determination of tiers, and I’m sure others concur, is substandard. There are many with voting privilege who don’t necessarily have a firm grasp on our current metagame or Smogon’s philosophy. That said, I don't believe Bold Voting will do much better since it can only filter the votes in the current testing process through the arbitrary judgment of whoever chooses which votes count. It's for these reasons that I'm proposing a new way of testing suspects. The voting process will still be based on merit and I feel this process will take a lot of bias out of the voting process. The testing system will require work to implement, but it is necessary in order to filter out the bias.

Setup:
1. Everyone who wishes to partake in the process must register their accounts and decide if the suspect OU or Uber before the testing commences. They will then be separated into two groups, OU and Uber, based on their decision
2. People in the Uber group must use the suspect.
3. People in the OU group may not use the suspect.
4. Testing will commence for a period of time on a different, suspect ladder (a different server or an honor system clause may be necessary).
5. If a player wishes to switch their decision, they must use a new account with a blank rating. Obviously, you may only have one account per person.
6. We will use the statistics on the ladder at the end of the testing to determine the Suspect's tiering. It will be deemed Uber if there are significantly more top players in the Uber group than the OU group(more than a 2:1 ratio)

(Optional) Deviation requirements for voting to promote battling.

This testing method cements the implied definition of Uber that was unverifiable before: A suspect is considered Uber if its usage generates a decided advantage over an opponent who is not using the suspect, even when said opponent has full knowledge of the presence of the suspect on the opposing team and tries their best to inhibit the suspect from fulfilling its intended role. This definition should fit for all Ubers and previous suspects voted into Uber. The question "How much of an advantage makes it Uber" will be determined by the arbitrary margin set in Step 6.

Why it works:

-By doing well in the ladder, you are essentially "proving" your vote. If the suspect is indeed broken, then those in the Uber group will have a decided advantage, and therefore, reach the rank necessary to prove it Uber. This is true for the opposite case as well. Doing well in the ladder against the Uber suspect without employing it also reinforces your vote.
-The system forcibly creates a faux centralization and gauges the suspect’s performance in a certain environment.
-Unlike the current system, there are a limited number of spots which matter, which gives motive for the battlers to do their best instead of simply meeting the requirements to vote.
-The test guarantees that all of those who are assuming its tiering have to be in an environment where they are forced to play with or against the suspect.
-It gets rid of most of the self-interest because in this process, people can only prove or disprove that a Suspect is Uber. i.e. If someone who has a team that is only weak to Skymin, he will vote it into Uber no matter what the circumstances because it helps his team out. However, in this test, he can only benefit the testing process because he is proving or disproving his opinion by winning or losing.

Foreseeable problems for this method of testing:

-Insufficient number of testers- If there are much more players for one group or the other. In that case, take volunteers on the other side or pick testers to switch positions in a random fashion.
-Mass Conspiracy- If somehow the best players are lumped into one category while the worst are in the other, then the results would be obviously skewed.
-Human Error- If somehow the testers voting Uber do not use the suspect to the fullest extent, similar to how Deoxys-S’s full potential was realized quite late into its usage period.
-Too difficult to implement- Unavoidable, that one is really a bummer =/

Thoughts, responses and criticisms are, of course, welcome. Anything thoughtful will only help the testing method.

Lastly, thanks to outofdashwz for helping me bounce around ideas and proofread the post, Obi/ipl for inspiring this idea.

Syberia · Dec 7, 2008

Uhh, what? Isn't this exactly what we're trying to avoid, which is people deciding their votes before the test process is begun?

guoguo · Dec 7, 2008

Syberia said:
Uhh, what? Isn't this exactly what we're trying to avoid, which is people deciding their votes before the test process is begun?

That is why they're free to switch votes mid-test. It really doesn't matter who decides what someone is voting for, as long as they're participating in the test. The point of having people vote beforehand ensures that they are productive in the voting process because they are trying to prove a pokemon to be Uber/OU.

Edit: P.S. Where do you live in Irvine? I live in the same city

Matthew · Dec 7, 2008

I somewhat like this idea. We "assume" that the suspect is either OU or Uber. The people who say Uber are allowed to use it in their teams while the people who said OU cannot. If the people who do use it are much higher than the people who didn't than it is Uber. If the people who don't employ it still are just as high as the people who did empoly it then it is OU.

lilyhollow · Dec 7, 2008

I didn't read much of this yet but

5. If a player wishes to switch their vote, they must use a new account with a blank rating. Obviously, you may only have one account per person.

this horribly discourages switching votes doesn't it?

guoguo · Dec 7, 2008

this horribly discourages switching votes doesn't it?

I suppose you could allow vote switching until the second half of the testing without changing the results of the test, but any later and it'll screw things up. Thanks for pointing that out

the_artic_one · Dec 7, 2008

Yeah it would be better to just require that people must make teams both with and without the suspect and I'm not even sure that would be such a great idea.

Mia [old] · Dec 7, 2008

I do not see why the usage of the suspect Pokemon should be restricted or required for eligible voters. Regardless of whether a player uses that Pokemon, its properties should become evident through repeated interaction as an opponent. You must adapt to its presence if you wish to defeat it and you need to be aware of its downfalls when playing it. Ultimately, playing with and against the suspect Pokemon provide experience enough to understand it and to determine where it properly belongs.

The voting system is brilliant. You have a predetermined filter to find the players with the best understanding of the game. Once you have the best competitive players, you allow them to mold competitive game play. Not only does this minimize the effort put forth by the staff, it also allows Smogon to reinforce its reputation of a competitive Pokemon site while demonstrating a proper respect the the community members and their wishes. Since all isolated competitive communities should encourage competitive and community growth, this is the best and easiest way to do all of the above. It gets even better, as that predetermined filter is so easy to adjust to ensure that the proper group set for voting is selectively superior. Is the voting requirement too low? Just raise it. The voting system is definitely the best way to make changes for this particular site.

That said, I think it could also be refined. At the moment, there is no clear way to test the suspect Pokemon. There are also no defined values for the percentage of what will constitute a solid decision, and there are no definitions to guide the eligible voters towards the best decision. There should be no interference with the voting pool or the decisions made therein.

For example, with the recent Shaymin-S vote, they had the right idea, but it was not executed properly. There is no Shaymin-S exclusive suspect test ladder, rather it was lumped into the standard ladder. There was nothing that stated that a 60% majority will be required before a decision is made. We have no definition of Uber past what the community generally feels does not belong in OU. Some members in the voting thread have made it clear that they intend to discard votes with undeveloped logic behind them, despite that the voters have already fulfilled the eligibility requirement. The Shaymin-S vote also lacks a short term solution should the vote come down to a deadlock or a tie.

I feel that all of these errs are small and can be fixed easily for future votes so long as we learn from our mistakes this time around. This is how I would propose that we organize future suspect votes.

umbarsc · Dec 7, 2008

The voting system is definitely the best way to make changes for this particular site.

I disagree with that reasoning, because an excellent archer still probably won't be able to make a bow as well as someone who is skilled at building bows. In the same way, people that are good at Pokemon might not know the makings of a healthy metagame, and tiering should be as free of bias as possible.

Bob Marley · Dec 7, 2008

Since I am from PB, I am rather unfamiliar with how the system currently works, but I don't think limiting one's testing to one side would help. Instead, we should have a control group test both sides for a period of time and then vote. This might persuade those who vote distinctly from just hating it to a reasonable opinion from testing.

Mia [old] · Dec 7, 2008

umbarsc said:
The people that are good at Pokemon might not know the makings of a healthy metagame, and tiering should be as free of bias as possible.

If you avoid irrelevant comparisons, you will be much easier to understand!

Since we are structuring the game to fit what we deem as "competitive", we have already taken upon ourselves a personalized understanding of what a healthy metagame and proper tiering should be. That said, we have already included bias to create the format that is currently used, and bias will be included regardless of any alternatives that are used.

However, if you can think of a way to implement the general interest of the competitive community in a more effective manner or with less bias, post it here so we can learn from it and use it =)

edit: While I'm making requests, I would really appreciate the input from any members directly related to the decision making here at Smogon to add further insights that we can also consider.

guoguo · Dec 8, 2008

Mia said:
I do not see why the usage of the suspect Pokemon should be restricted or required for eligible voters. Regardless of whether a player uses that Pokemon, its properties should become evident through repeated interaction as an opponent. You must adapt to its presence if you wish to defeat it and you need to be aware of its downfalls when playing it. Ultimately, playing with and against the suspect Pokemon provide experience enough to understand it and to determine where it properly belongs.

If you understand the aims of the test and the results that we are looking for, then you'll understand why the usage is necessarily restricted and required. In simpler terms, we put the testers in two groups, let them ladder for a bit, and whichever group holds more of the top spots get their way. Restricting usage allows us to keep track of the performance of the Suspect through the players who use them.

Mia said:
The voting system is brilliant...
...it gets even better, as that predetermined filter is so easy to adjust to ensure that the proper group set for voting is selectively superior. Is the voting requirement too low? Just raise it. The voting system is definitely the best way to make changes for this particular site.

guoguo said:
I feel that the current system for determination of tiers, and I’m sure others concur, is substandard. There are many with voting privilege who don’t necessarily have a firm grasp on our current metagame or Smogon’s philosophy.

Mia said:
At the moment, there is no clear way to test the suspect Pokemon.

guoguo said:
It's for these reasons that I'm proposing a new way of testing suspects.

Mia said:
We have no definition of Uber past what the community generally feels does not belong in OU.

guoguo said:
This testing method cements the implied definition of Uber that was unverifiable before: A suspect is considered Uber if its usage generates a decided advantage over an opponent who is not using the suspect, even when said opponent has full knowledge of the presence of the suspect on the opposing team and tries their best to inhibit the suspect from fulfilling its intended role.

I would suggest doing some more reading and responding.

Sustained Serenity · Dec 8, 2008

The reason people need to aquire ratings on both standard and suspect is to have a feel for both the metagames. They may say something is broken in standard, but in suspect it is not really in an issue. They decide which metagame is better. How can they see which is better if they can't play the other metagame?

Mia [old] · Dec 8, 2008

guoguo said:
I would suggest doing some more reading and responding.

I responded to umbarsc, not to your initial post. The purpose of my post was to convey my opinions to umbarsc, not to act as a repetition of your own. If you feel that I have wronged you in some way, I apologize.

Sudo · Dec 8, 2008

guoguo said:
If you understand the aims of the test and the results that we are looking for, then you'll understand why the usage is necessarily restricted and required. In simpler terms, we put the testers in two groups, let them ladder for a bit, and whichever group holds more of the top spots get their way. Restricting usage allows us to keep track of the performance of the Suspect through the players who use them.

This is the reason I don't feel this proposal of yours is a sufficient improvement over what is being discussed by the administration. Here you're making the testers the control group; i.e., you're assuming that all of the testers have the same exact amount of skill, which is not the case. You could be inadvertently distributing all of the skilled players to one side, based on a vote that's been formed without any experience to back it up.

tldr; this is simply a science experiment gone awry.

Evil Hamster · Dec 8, 2008

I would personally just let people vote whatever they want, however many people there are on either side, and then just take the average rating of those that you do have. Chances are, if you can't find enough people on one side, then it's probably really obvious which way the vote is going anyway. I can't say I'm a fan of the idea though.

CardsOfTheHeart · Dec 8, 2008

I have to disagree with any process where votes are decided before the process begins, especially those where the votes are hard/costly to change because of the rules. Like Syberia posted, it's the kind of thing that I think we need to avoid.

That said, I do like the idea of registering accounts for the test. It could save us quite a bit of work, especially when confirming accounts. Here is what I would add to it:

Require a minimum number of battles to be conducted on each account.

We want our voters to have sufficient experience with the suspect in order to make an educated decision about its status whether it be "ban/not ban" when it comes to attacks (Evasion/OHKOs) or "OU/Uber" when it comes to Pokemon (Lati@s/Manaphy). If we require the voter to participate in X number of battles, then we can feel more confident that the voter has sufficient experience.

cim · Dec 8, 2008

I think we should do minimum number of battles but ignore rating entirely. If something is truly broken, then battles will come down to speed ties or hax much more often and thus it will be harder to get a good rating than normal.

Also Skymin is conclusive proof that good rating and good vote have absolutely no correlation.

guoguo · Dec 8, 2008

Sudo said:
This is the reason I don't feel this proposal of yours is a sufficient improvement over what is being discussed by the administration. Here you're making the testers the control group; i.e., you're assuming that all of the testers have the same exact amount of skill, which is not the case. You could be inadvertently distributing all of the skilled players to one side, based on a vote that's been formed without any experience to back it up.

tldr; this is simply a science experiment gone awry.

I addressed this issue near the end of my post. Chances are, the skill levels of the two groups would be even enough not to affect the results enough unless the Suspect was incredibly borderline Uber.

CardsOfTheHeart said:
I have to disagree with any process where votes are decided before the process begins, especially those where the votes are hard/costly to change because of the rules. Like Syberia posted, it's the kind of thing that I think we need to avoid.

That said, I do like the idea of registering accounts for the test. It could save us quite a bit of work, especially when confirming accounts. Here is what I would add to it:

Require a minimum number of battles to be conducted on each account.

We want our voters to have sufficient experience with the suspect in order to make an educated decision about its status whether it be "ban/not ban" when it comes to attacks (Evasion/OHKOs) or "OU/Uber" when it comes to Pokemon (Lati@s/Manaphy). If we require the voter to participate in X number of battles, then we can feel more confident that the voter has sufficient experience.

In my opinion, most voters have already decided to vote one way or the other before even testing, regardless of what happens during testing. This sort of test is not to shape the opinions of the voters, but to test the viability of the Suspect in question. In this respect, it matters very little what you vote for, but that you "prove" your vote by doing well in the ladder, much like what ipl did for Wobbufett and Deoxys-E. In theory, all votes could be decided beforehand randomly, and you should end up with similar results, In practice, however, because voters want a Suspect in one tier or another, they would tend to act counterproductively if placed in the wrong group. Hence, allowing participants to decide beforehand.

Abacus · Dec 8, 2008

I like your idea a fair deal and I hope that everyone at least reads and understands your proposal. However, some of the implementation is a little sticky and has a few problems, but I think that a slight change could fix that. The basic principle behind your idea is to see whether teams with the suspect preform better than teams without even with complete preparation with the suspect in mind, right? So couldn't we test this just by looking at the win ratio of teams containing the suspect, without needing all the other rules that go with your proposal? This seems far easier to implement and doesn't require any additional work on the part of the the actual battlers (thus attracting a larger pool of testers). Is this fair or am I misunderstanding what you are saying?

umbarsc · Dec 8, 2008

Mia said:
If you avoid irrelevant comparisons, you will be much easier to understand!

Since we are structuring the game to fit what we deem as "competitive", we have already taken upon ourselves a personalized understanding of what a healthy metagame and proper tiering should be. That said, we have already included bias to create the format that is currently used, and bias will be included regardless of any alternatives that are used.

However, if you can think of a way to implement the general interest of the competitive community in a more effective manner or with less bias, post it here so we can learn from it and use it =)

edit: While I'm making requests, I would really appreciate the input from any members directly related to the decision making here at Smogon to add further insights that we can also consider.

The analogy was perfectly relevant. What I'm saying is that using a object/system is a skill not necessarily correlated with building/maintaining that object/system.

Sudo · Dec 8, 2008

guoguo said:
I addressed this issue near the end of my post. Chances are, the skill levels of the two groups would be even enough not to affect the results enough unless the Suspect was incredibly borderline Uber.

You acknowledged it as a problem, but you didn't propose anything to resolve it regardless. Your "solution" is to hope that the two groups will near each other in skill, but with the guidelines you suggested there's a high probability that it won't be the the best case scenario you want. You're letting anyone sign up; how do you ascertain their skill level (since the only votes that'll be counted are those who get to the very top)? This is really the biggest flaw with your plan. We're not dealing with automated programs here, we're dealing with people. Some people are inherently better at pokemon than others are; how will an arbitrary vote (i.e. one decided pretty much at random since the testers have no experience at this point with the suspect) evenly and objectively make sure both sides are comprised of equal skill?

You also mention deviation requirements could be optional "to promote battling". Are these requirements necessary before testing (to see if you're good enough to get into the program) or during the test (like the system being proposed in the PR forum)?

Now, I don't mean to sound like I'm totally against this proposal of yours. I do like how your plan makes sure the active community members of Smogon are the ones conducting this test, not people who come back from vacation and happen to realize their account is eligible to vote. This method also eliminates the likelihood of multiple accounts/votes. I also like how a definition of Uber could be proved by this process, but until we can resolve the skill issue the results can't be accurate.

CardsOfTheHeart · Dec 8, 2008

guoguo said:
In my opinion, most voters have already decided to vote one way or the other before even testing, regardless of what happens during testing.

You'll get no argument from me on that, unfortunately. :/

guoguo said:
This sort of test is not to shape the opinions of the voters, but to test the viability of the Suspect in question.

You're not trying to shape the opinions of the voters during the test, yet you require them to form an opinion before the test even begins? I thought we were trying to avoid that...

Okay, I need to ask this: are you encouraging voters to vote before testing begins and not change their minds? The way the actual vote is conducted pretty much requires that voters don't switch their vote in order to have an impact on the vote.

If voters are making up their minds beforehand and aren't given total freedom to change their minds, how is that much different from what's happening now? At least the current system allows you to easily change your mind during the process. This system seems to encourage everything to stay the same.

guoguo said:
In this respect, it matters very little what you vote for, but that you "prove" your vote by doing well in the ladder, much like what ipl did for Wobbufett and Deoxys-E. In theory, all votes could be decided beforehand randomly, and you should end up with similar results, In practice, however, because voters want a Suspect in one tier or another, they would tend to act counterproductively if placed in the wrong group. Hence, allowing participants to decide beforehand.

This voting method places weights on each vote according to the voter's skill; if the voter isn't skilled enough to get a top spot, the vote doesn't count. That's all well and good, but what about the skilled voters that want to change their mind at the last minute? This system says "screw you" to that because if you want to change your vote, then you have to reset your account, meaning that your vote probably won't count and you're otherwise stuck with a vote that you don't want to cast.

Long story short, your system looks to encourage entering the test with an opinion in mind and not changing it throughout the test, the opposite of what I believe the testing process should do. I want voters to enter the test with no opinion on the matter and form an educated opinion through their experience during testing.

'But that's impossible, Cards. Voters are already going into these tests with their own opinions on a suspect. Testing isn't going to make a difference to several of these opinions.'

If this is such a problem, then why does it look like you are encouraging more of it?

I am Gengar · Dec 8, 2008

umbarsc said:
I disagree with that reasoning, because an excellent archer still probably won't be able to make a bow as well as someone who is skilled at building bows. In the same way, people that are good at Pokemon might not know the makings of a healthy metagame, and tiering should be as free of bias as possible.

sigged for awesomeness

umbarsc · Dec 8, 2008

CardsoftheHeart, I don't think the idea is to take the voters' votes into account. Unless I'm mistaken, I think the idea isn't to vote exactly, but to give your opinion before the testing starts, and you sort of "prove" your opinion by battling with or without the Pokemon (if you think it's uber or not). Then, sort of see the average of how well the "uber" people are doing and how well the "OU" people are doing. I don't believe the votes are actually taken into account when determining tier, unless I'm mistaken.

[Proposal] A new way of testing suspects

guoguo

Syberia

[custom user title]

guoguo

Matthew

I love weather; Sun for days

lilyhollow

guoguo

the_artic_one

Mia [old]

umbarsc

Bob Marley

Mia [old]

guoguo

Sustained Serenity

Guest

Mia [old]

Sudo

Evil Hamster

CardsOfTheHeart

cim

happiness is such hard work

guoguo

Abacus

umbarsc

Sudo

CardsOfTheHeart

I am Gengar

umbarsc

Users Who Are Viewing This Thread (Users: 1, Guests: 0)