Fixing UU

Caelum

qibz official stalker
is a Site Content Manager Alumnusis a Community Leader Alumnusis a Smogon Discord Contributor Alumnusis a Tiering Contributor Alumnusis a Top Contributor Alumnusis a Smogon Media Contributor Alumnusis a Battle Simulator Moderator Alumnus
X-Act said:
EDIT: Well, what I mean is that, as I understood it back then, stuff goes to UU by default, unless a Pokemon is extremely obvious that it is not UU. If that's what you told me, then ignore this post.
Yes, that is what I meant. I don't think (and I don't think most other think) that Dugtrio or Donphan pose a significant threat. Sorry for the mix-up.

X-Act said:
My point is that a single Pokemon introduced to UU can be a check for Pokemon that were voted BL previously. Or, a single Pokemon going out of UU can imbalance the UU metagame. In both cases, a retest of certain Pokemon would be needed every 3 months... which is tedious, and something that needs to be addressed.
I've stated a few times that I don't believe a singular Pokemon can be relied upon to call into question the status of a Pokemon. Just to take your example, Raikou + Dugtrio. While I appreciate that Dugtrio can pose a significant threat to Raikou, I don't think one check or counter means something isn't broken when without it, it is. If that is the case, I'd question whether the decision was right in the first place. Just an example, Cresselia (particularly Trick-Scarf), can decently counter Garchomp and Choice Band Ice Punch + Ice Shard Weavile can act as a check to almost any Garchomp.

The situation that multiple Pokemon that call into question something being broken is unlikely and can be dealt with in the case that event comes up.

I think the nature of DP makes it so that a singular check or counter (or even 2 or 3) doesn't affect the overall status of a Pokemon which is why I don't see the situations you described being a significant issue. I suppose you view that differently, but that has always been my view.
 

DougJustDoug

Knows the great enthusiasms
is a Site Content Manageris a Top Artistis a Programmeris a Forum Moderatoris a Top CAP Contributoris a Battle Simulator Admin Alumnusis a Smogon Discord Contributor Alumnusis a Top Tiering Contributor Alumnusis an Administrator Alumnus
We are going to use Baton Pass with the UU leadership team!

The current UU leadership team of Caelum, RB Golbat, and Great Sage have done a great job of organizing testing for the UU tier so far. Their hard work has helped make UU one of the most active, interesting metagames in competitive pokemon. They will conclude the current suspect testing round -- which only has voting remaining, IIRC.

After voting is completed, we will transition leadership responsibilities to JabbaTheGriffin, jrrrrrrr, and Reachzero. This team has some new ideas of how to move forward with suspect testing in UU. I look forward to seeing the UU metagame continue to thrive with their guidance.
 

Jumpman16

np: Michael Jackson - "Mon in the Mirror" (DW mix)
is a Site Content Manager Alumnusis a Top Team Rater Alumnusis a Battle Simulator Admin Alumnusis a Smogon Discord Contributor Alumnusis a Researcher Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis an Administrator Alumnus
As has been said several times in this thread, the simple act of saying "your vote is partially dependent on how much you see and use this Suspect" will automatically inflate the use of that Suspect in the metagame. As UU votes are relative to the metagame even more so than OU, placing SExp on the table causes a great deal of artificial external influence on the metagame.

It's been gone over at least 4 times now ._.
As I said on IRC, this is impossible if supposed suspects have not even been nominated yet. As far as I'm concerned, it's the fairest thing in the world to question whether and how much a player actually used a pokemon he wants to vote on, especially when the only "influence" would be present regardless of whether we used SEXP or not.

What if the suspect actually and truly is not particularly good? This means only the players that can work around it can vote at all, which makes the bias inherently toward "ban it", even if just a little bit, as the only people that get to vote on a Suspect are the ones that use it successfully (and the people that use it successfully are obviously much more likely to vote it Uber).

Yes, there are some cases when it is obvious a Pokémon is at best extremely good, but this will definitely not be the case for every UU suspect and we can't afford to fuck shit up.
"Good" is just about the most subjective concept imaginable, SEXP is not. See why I want to use SEXP to do away with as much subjectivity as possible? If the Suspect isn't good, I highly doubt it would have even been both nominated and actually voted a Suspect by whatever majority of UU players actually decide that. Name one UU Suspect or even OU Suspect that wasn't widely regarded as at least "good".

And you can say you don't trust the metrics of the SEXP as much as you want. I trust one's assessment of "good" even less.

The same reason we should listen to someone who wins with Crobat. They've experienced the Pokémon. If he only faced Crobat 6 or 7 times and managed to not be fucked up by it to the point of being pissed enough to vote for its Suspect nomination, he obviously didn't have a problem with it. I'd prefer he faced it more and all, but if he honestly didn't see it as an asset to his team and enough other people didn't that he only faced it "6 or 7 times" then it either won't be a Suspect or deserves to be a landslide vote for no ban.
You're working under the assumption that this player played against people who also "didn't see it as an asset" to their own teams, or at least would therefore vote OU without question. That is a very faulty assumption to make. There are several players who made the Rating and Deviation for the Latios test and barely used Latios, only saw him a handful of times relative to the other players, and actually still voted him uber in their submissions.

For example, Stathakis missed the upper reqs for the Latios test by half a deviation point, going 41-14 in the process. He voted Latios uber. Guess how many times he actually used Latios? Zero. Limitless made the upper reqs, going 72-15 in the process. He voted Latios uber. Guess how many times he actually used Latios? Zero. paramylodon made the lower reqs and was 81-42 in the Latios test. He was "unsure of how he would vote", which shouldn't at all surprise you when you realize how many times he actually used Latios (you have one guess).

There's obviously the chance that players will naturally not run into a Suspect much when playing battles. It's unfortunate, but this is a risk they bring upon themselves for not using a Suspect they were encouraged to use and to break and relying on others to use it for them so they might know what they're talking about when asked to vote.

People want to sound off on a Suspect because voting is seen as a Good Thing on Smogon, as it should. In UU, "experiencing the Pokémon" is as important as "experiencing the metagame", and you can only 100% ensure one or the other. But honestly, with ranking and deviation votes, if you can make rank and only run into one or two of a Suspect then it's clear everyone isn't completely overwhelmed by how good it is. If you manage to do well enough against it that your ranking isn't ruined then it's pretty obvious you know how to deal with it, which means you can decide how "good" it is.
Everybody who plays competitive pokemon wants a say in the tiering of competitive pokemon, this has been true forever in every single generation (yes, even RBY, "Mewtwo's Cheapness Revisited" is a topic posted on Azure Heights over nine years ago a few of us here remember). SEXP ensures that you literally know what you're talking about.

And again, you're working under a very faulty assumption with your reliance upon Rating and Deviation. There are many, many people who made rank easily without having good Latios SEXP numbers relative to everyone else, or barely made rank without having good Latios SEXP numbers relative to everyone else, meaning their Rating/Dev would have been a lot higher if we didn't factor in the...entire reason for having the Suspect Test. This is the entire reason SEXP was implemented.
 

cim

happiness is such hard work
is a Contributor Alumnusis a Smogon Media Contributor Alumnus
As I said on IRC, this is impossible if supposed suspects have not even been nominated yet. As far as I'm concerned, it's the fairest thing in the world to question whether and how much a player actually used a pokemon he wants to vote on, especially when the only "influence" would be present regardless of whether we used SEXP or not.
The last sentence is an attempt to minimize the "data contamination" problem but "the data's available even if we don't measure it" is missing the point. People will know what the Suspects are going to be early on. It won't be official, but people will battle with Suspect X and Suspect Y based on rumors and heresy. Before both Suspect Tests it was blatantly obvious which would be nominated. If you want proof, look in the old UU threads two weeks before it happened.

It's going to influence how we play.

"Good" is just about the most subjective concept imaginable, SEXP is not.
Considering SEXP is still subjectively applied, the fact that the number you see on your screen is "objective" doesn't count for shit.

See why I want to use SEXP to do away with as much subjectivity as possible?
Opinions on Pokémon are subjective things. To do away with subjectivity in an opinion poll just doesn't make much sense.

If the Suspect isn't good, I highly doubt it would have even been both nominated and actually voted a Suspect by whatever majority of UU players actually decide that. Name one UU Suspect or even OU Suspect that wasn't widely regarded as at least "good".
Crobat in the first phase. How it was being used back then was not in a manner I'd call "good". I attempted to use Crobat, but it never fit my style of team very well. By the second voting phase that changed a little but at the time Crobat just was a worse option over other Pokémon I was looking for.

Regardless of the specific example, it's 100% bound to happen.

And you can say you don't trust the metrics of the SEXP as much as you want. I trust one's assessment of "good" even less.
Except you can evaluate what one thinks is "good" or not in a bold vote. You can't evaluate SEXP. None of us can. I don't trust anything I can't analyze myself, and neither do a ton of people.

You're working under the assumption that this player played against people who also "didn't see it as an asset" to their own teams, or at least would therefore vote OU without question. That is a very faulty assumption to make. There are several players who made the Rating and Deviation for the Latios test and barely used Latios, only saw him a handful of times relative to the other players, and actually still voted him uber in their submissions.
And before you looked at Suspect Experience, you thought their rationale looked fine and dandy? Really?

There's obviously the chance that players will naturally not run into a Suspect much when playing battles. It's unfortunate, but this is a risk they bring upon themselves for not using a Suspect they were encouraged to use and to break and relying on others to use it for them so they might know what they're talking about when asked to vote.
I thought earlier you were saying players didn't necessarily have to use the Suspect to gain Suspect Experience. Anyway, this is kind of my whole point. When people have a distinct advantage in using Pokémon likely to be banned, they will use them in order to vote, regardless of whether or not they find they gain a competitive advantage in doing so.

Everybody who plays competitive pokemon wants a say in the tiering of competitive pokemon, this has been true forever in every single generation (yes, even RBY, "Mewtwo's Cheapness Revisited" is a topic posted on Azure Heights over nine years ago a few of us here remember). SEXP ensures that you literally know what you're talking about.
No it doesn't. You can still vote for a shitty reason without it. I could be like FiveKRunner and vote based on Nintendo tiering. Sucking up and using the Pokémon a few hundred times (and well) won't somehow make me have a revelation. Objectively.

And again, you're working under a very faulty assumption with your reliance upon Rating and Deviation. There are many, many people who made rank easily without having good Latios SEXP numbers relative to everyone else, or barely made rank without having good Latios SEXP numbers relative to everyone else, meaning their Rating/Dev would have been a lot higher if we didn't factor in the...entire reason for having the Suspect Test. This is the entire reason SEXP was implemented.
In OU, it makes more sense than UU, yes, as the Suspect Test exists to... test the Pokémon. I mean I still don't agree, but whatever. My point (and Obi's point, Colin's point, Jabba's point, and almost everyone I've seen involved in UU) is that the harm SEXP does to a metagame is greater than the good. And among the reasons I don't like it in OU (which still apply), the others "push it over the edge" since there is no Suspect Ladder (and shouldn't be).
 

Jumpman16

np: Michael Jackson - "Mon in the Mirror" (DW mix)
is a Site Content Manager Alumnusis a Top Team Rater Alumnusis a Battle Simulator Admin Alumnusis a Smogon Discord Contributor Alumnusis a Researcher Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis an Administrator Alumnus
The last sentence is an attempt to minimize the "data contamination" problem but "the data's available even if we don't measure it" is missing the point. People will know what the Suspects are going to be early on. It won't be official, but people will battle with Suspect X and Suspect Y based on rumors and heresy. Before both Suspect Tests it was blatantly obvious which would be nominated. If you want proof, look in the old UU threads two weeks before it happened.

It's going to influence how we play.
So if we went back and looked at the SEXP data of the people who first voted on Abomasnow and Gallade, and we found that some of the voters only used and experienced these pokemon two or three times, you would have no problem with me or Doug calling out these users for having voted on pokemon they literally had virtually no experience with, right?

Preemptively citing the notion of SEXP influencing how people would play and further implying that this would be a negative influence is a huge cop-out. If you want to do your Good Thing and vote on a Suspect, then use the pokemon that people are whispering will be Suspects. If you don't, and somehow come across it much less than the rest of the UU players who equally did not know it was going to be a Suspect, then you shouldn't get to vote on it.

I don't think you understand that it isn't possible for most other people to think something is shitty, but for it also to be used enough by anyone for anyone to gain decent SEXP on it regardless of whether Doug or I actually decide to use SEXP for the test.

Considering SEXP is still subjectively applied, the fact that the number you see on your screen is "objective" doesn't count for shit.
Everything about deciding voters for the Suspect Test Process is subjective. Doug and I have disqualified voters who made the Rating and Deviation thresholds because we found out they cheated. Aeolus and I have had to wade through the reasoning of hundreds of voters for Manaphy, Latios and Latias and subjectively decide who had enough experience to vote. Tangerine and I had to wade through hundreds of straight up bold votes votes on Wobbuffet and Deoxys-S last year and subjectively decide who had enough experiece to vote.

Even you personally have even argued that there shouldn't even be an Upper Tier, one of the few suggestions of yours I have agreed with and put in place. This change did nothing else but do away with the only objective, carte blanche manner in which players could 100% cast their votes without being required to write submissions for the Tiering Conributors' critique. Chris, you are being pessimistic to a fault here with your resistance to the only objective metric in the Suspect Test Process that pertains to the Suspects themselves, and it is not coming off well.

Opinions on Pokémon are subjective things. To do away with subjectivity in an opinion poll just doesn't make much sense.
This reminds me of a quote from obi: "That's like saying 'I won't change my opinion that eggs are purple'." An "opinion" that Choice Specs Latios is dead weight when you used Latios a whopping one time, faced it ten times, and went 1-9 in those battles is a terrible "opinion" that should have no bearing on its tiering. These kinds of opinions can only be weeded out with SEXP, and rather easily, too.

Crobat in the first phase. How it was being used back then was not in a manner I'd call "good". I attempted to use Crobat, but it never fit my style of team very well. By the second voting phase that changed a little but at the time Crobat just was a worse option over other Pokémon I was looking for.

Regardless of the specific example, it's 100% bound to happen.
Wrong. I'm aware that everyone voted it UU, but it was still nominated by enough people whose reasoning passed the judgment of both Caelum and then Tangerine to be made a Suspected. That makes it "good" by the only application of the word that matters in the the Suspect Test Process.

Except you can evaluate what one thinks is "good" or not in a bold vote. You can't evaluate SEXP. None of us can. I don't trust anything I can't analyze myself, and neither do a ton of people.
You don't have to "evaluate" whether someone is lying to you about how much they used a Suspect any more than you have to evaluate whether someone is lying to you when he says the egg you are both looking at is purple. (Even if it were actually purple for some reason, you could both see that and be able to evaluate him with no problem.) It's cliché, but "numbers don't lie".

And before you looked at Suspect Experience, you thought their rationale looked fine and dandy? Really?
I have yet to read a submission that I thought was "fine and dandy" only to see that the player had very low SEXP. But what difference does that make to the entire Suspect Test Process anyway? People have always been capable of lying in their votes since Tangerine and I first tallied bold votes last summer. SEXP can only help talliers determine whether or not the more questionable submissions are actually backed in fact. I don't see how you can possibly argue that using SEXP makes it somehow easier for players to decieve talliers.

I thought earlier you were saying players didn't necessarily have to use the Suspect to gain Suspect Experience. Anyway, this is kind of my whole point. When people have a distinct advantage in using Pokémon likely to be banned, they will use them in order to vote, regardless of whether or not they find they gain a competitive advantage in doing so.
Wrong again. The bit you cut out about Stathakis and Limitless making the Upper Reqs (Stathakis stopped under the assumption he was in) clearly demonstrated cases where players had a distinct advantage not using a Suspect, not using them in order to vote, but voting them uber anyway. Remember, neither of them used Latios even once. And to add to that, I also know for a fact that both Stathakis and Limitless had a better record playing against Latios than they did when they did not playing against it, battles that were, for all intents and purposes, standard OU battles.

They literally had a competitve advantage in not doing so. And I know all of this because of SEXP. Even if these were the only two cases I could cite, you would still be flat out wrong in your assumption.

No it doesn't. You can still vote for a shitty reason without it. I could be like FiveKRunner and vote based on Nintendo tiering. Sucking up and using the Pokémon a few hundred times (and well) won't somehow make me have a revelation. Objectively.
Again, nothing anyone has ever suggested has allowed us to circumvent the need to analyze people's reasoning, a notion with which you would personally disagree anyway. Your point is moot and will be forever.

In OU, it makes more sense than UU, yes, as the Suspect Test exists to... test the Pokémon. I mean I still don't agree, but whatever. My point (and Obi's point, Colin's point, Jabba's point, and almost everyone I've seen involved in UU) is that the harm SEXP does to a metagame is greater than the good. And among the reasons I don't like it in OU (which still apply), the others "push it over the edge" since there is no Suspect Ladder (and shouldn't be).
Jabba was originally under the impression that SEXP would be used the way it currently is for the OU Stage 3 test (as were you). Colin and Obi soon disgressed to arguing against the need for keeping SEXP a hidden metric, or at least you can't blame me for seriously not remembering their actual issue with it harming the UU metagame specifically since the former was what they were arguing about for over an hour on IRC with me, not any harm it does specifically to the UU test.
 

Hipmonlee

Have a nice day
is a Community Contributoris a Senior Staff Member Alumnusis a Smogon Discord Contributor Alumnusis a Tiering Contributor Alumnusis a Top Contributor Alumnusis a Battle Simulator Moderator Alumnusis a Four-Time Past WCoP Champion
They literally had a competitve advantage in not doing so. And I know all of this because of SEXP. Even if these were the only two cases I could cite, you would still be flat out wrong in your assumption.
His assumption is still reasonable. He never said every person would do so. His point has always been you are introducing a bias, not that you are introducing a hard and fast rule that everyone will do something.

Also you havent actually provided any evidence they had an advantage by not using the suspect, they might have done even better with the suspect on their team. This is all but irellevant in their case anyway, as they qualified comfortably. The people who one would think would be most influenced by this are those who are the least confident of their ability to qualify without using the suspect.

Also you dont actually know this because of SEXP you know this by looking at their win loss records with various criteria (at least this is how you have presented it to us).

Your examples of the applications of suspect experience in this thread all seem to be a case of "did this person lie". Which you said on IRC yesterday is not (only?) what suspect experience measures. Given that you are making people write submittals, I think that looking at their results and using them to work out if a person is telling the truth is a good idea. But that doesnt require a suspect rating or a suspect experience formula.. It seems that there is some other use for this formula, and that is where I, and I think other people opposed to the idea, take issue with it.

Based on my experience with the Latios test, where I qualified for voting despite only battling for one weekend, with a standard OU team with Latios hastly subbed in, and didnt have much success at all (or was this the latias test, I dont remember, one of the two). You are adding a group of players to the voter pool who werent able to make the rating/deviation requirements based on the fact that they used the suspect in all of their battles (and possibly you are also removing people from the voter pool who did make the requirements but who didnt use the suspect in their battles).

The whole point of the suspect test process is to canvas the opinions of top level battlers, because we want the ruleset to be based on top level battling. If you introduce lower level battlers just because they used a certain pokemon in their team(s) then you are distorting the results of the test.

It seems to me, from what you have said of it, that suspect experience is either trivial, or it is creating a bias, probably in favour of voting suspects uber. I really cant see how it could be anything different. Yes there will be cases where the opposite will happen, but that doesnt change the fact that there was a bias present..

Have a nice day.
 

haunter

Banned deucer.
Also you havent actually provided any evidence they had an advantage by not using the suspect, they might have done even better with the suspect on their team. This is all but irellevant in their case anyway, as they qualified comfortably. The people who one would think would be most influenced by this are those who are the least confident of their ability to qualify without using the suspect.
I don't think that any sort of evidence can be provided for something like that, but I think that an example of what Jumpman means has occured even during the recent Stage 3. I'm talking about the user Garchompfan23 (which later I've found out being Lady Bug): he built a perfect anti (suspect) metagame team, with only Latios in his team, and managed to climb the ladder until the 1st position (of course even because of his undoubted battling skills) in a few days, just being able to counter most of the suspect teams he faced, which, obviously, were filled with suspects. This is just an example, but facts like this happened even during the Latias\Latios and Manaphy tests, which I played all.

The whole point of the suspect test process is to canvas the opinions of top level battlers, because we want the ruleset to be based on top level battling. If you introduce lower level battlers just because they used a certain pokemon in their team(s) then you are distorting the results of the test.
Of course Jumpman doesn't need me to defend him, but I don't think he ever stated something like that, just take a look at this post.

Also, I'd like to say that I completely agree about the usage of the SEXP system in regard of the concession of the voting rights. I don't think that tiering decisions should be taken solely on the basis of a preset rating\deviation, but rather on a combination of actual knowledge of the suspect pokemons (which, by the way, can't be acquired just by facing them) and a good win rate with them.

Have a nice day.
 

Jumpman16

np: Michael Jackson - "Mon in the Mirror" (DW mix)
is a Site Content Manager Alumnusis a Top Team Rater Alumnusis a Battle Simulator Admin Alumnusis a Smogon Discord Contributor Alumnusis a Researcher Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis an Administrator Alumnus
His assumption is still reasonable. He never said every person would do so. His point has always been you are introducing a bias, not that you are introducing a hard and fast rule that everyone will do something.
Wrong. Chris said: "When people have a distinct advantage in using Pokémon likely to be banned, they will use them in order to vote". This is wrong. "Will". Compare "they will use them in order to vote" to your "everyone will do something" and tell me why he isn't 100% wrong in stating this absolute and why I shouldn't call him on it. I'm not going to accept any black-and-white blanket statements formed as arguments when they are misleading.

Also you havent actually provided any evidence they had an advantage by not using the suspect, they might have done even better with the suspect on their team. This is all but irellevant in their case anyway, as they qualified comfortably. The people who one would think would be most influenced by this are those who are the least confident of their ability to qualify without using the suspect.
First of all, I don't have to provide any evidence period, or could fabricate it to my benefit if I did. Why should you believe me that Stathakis was 41-14 and that Limitless was 72-15, or more importantly that both really did never use Latios? Second, you can only guess or hope that they might have done better with the Suspect in an attempt to help your argument. But do you honestly think these two players or anyone else wouldn't use Latios or any Suspect if they felt it would help them not only win, but to gain the SEXP that I even then had hinted you get by using the Suspect? Do you think they would intentionally do a disservice to themselves and their chances of qualifying by not using a Suspect they really felt would give them a competitive advantage? Do you think Stathakis would have risked missing the mark by half a deviation point if he knew using Latios on his team would grant him a definite competitive advantage?

And they qualified comfortably because they still actually gained a lot of SEXP despite not using Latios, which contrasts both with the overriding fear that you can't get SEXP if you don't use them and the assumption that people who are inclined to think a Suspect is BL/Uber will be inclined to use it.

Also you dont actually know this because of SEXP you know this by looking at their win loss records with various criteria (at least this is how you have presented it to us).
Wrong. I've presented this to you this way because saying more would divulge more details about the SEXP metrics than I want. Even regardless of that, I specifically said that "I know all of this because of SEXP." You can call it semantics all you want, but I purposely didn't say "I know this by looking at their SEXP" or "their SEXP reflected this phenomenon". Can you conceive of any reason anyone would have to look at anyone's separate win-loss ratios with and without a Suspect? Why would Doug, the only person privy to such numbers before I had the idea of Suspect Experience, have any cause to dig through millions of cells of battle data to look at something like that?

Are you guys finally getting the idea that maybe I actually have thought about the metrics of the Suspect EXP formula for a long, long time, and that maybe the rest of you haven't thought of things like why I would possibly be able to reference any player's separate win-loss ratios with and without a Suspect at a glance, and why I think that might matter?

Your examples of the applications of suspect experience in this thread all seem to be a case of "did this person lie". Which you said on IRC yesterday is not (only?) what suspect experience measures. Given that you are making people write submittals, I think that looking at their results and using them to work out if a person is telling the truth is a good idea. But that doesnt require a suspect rating or a suspect experience formula.. It seems that there is some other use for this formula, and that is where I, and I think other people opposed to the idea, take issue with it.
You take issue with it because you don't know what it is. As was belabored on IRC 15 hours ago, there are reasons Doug and I and now the UU Suspect Test heads are going to keep the formula private, among them the ability to easily game the system by knowing the metrics. The popular rebuttal to that has been "any metric that can be gamed is a poor metric", to which I would remind our community that every single Suspect Test ever has been more easily gamed without SEXP than with.

The opposition to the concept of SEXP seems to be forgetting how easy it was for people to actually vote. When we did bold voting as the only requirement for Wobbuffet and Deoxys-S, one could easily vote without having played on the Ladder with or against these Suspects, a concern that's completely addressed by SEXP. When Rating/Dev was the only requirement for Garchomp, one could easily vote without having any experience with or against Garchomp (which was kind of impossible since Garchomp wasn't actually on the Suspect Ladder for that month but my point stands), a concern that's completely addressed by SEXP. When a combination of both bold voting and Rating/Dev was implemented for Deoxys-S again and Shaymin-S, one could still vote without having picked up concrete experience on the Suspect, a concern that's completely addressed by SEXP. SEXP has improved upon the major flaws of every single iteration of the Suspect Test Process.

Based on my experience with the Latios test, where I qualified for voting despite only battling for one weekend, with a standard OU team with Latios hastly subbed in, and didnt have much success at all (or was this the latias test, I dont remember, one of the two). You are adding a group of players to the voter pool who werent able to make the rating/deviation requirements based on the fact that they used the suspect in all of their battles (and possibly you are also removing people from the voter pool who did make the requirements but who didnt use the suspect in their battles).
You're talking about the Latios test. The one where despite "only battling for one weekend" you had more battles than almost 40% of the 186 accounts Doug and I pulled SEXP on. The one where you used Latios in all of your 51 battles. The one where you had a better record when your opponent didn't also use Latios than when he did. I could easily argue the that preliminary metrics that I posted in Inside Scoop tipped you, a badgeholder who posted in that thread, off that I would specifically be looking at players who used the Suspect, but I won't stoop to that level.

The whole point of the suspect test process is to canvas the opinions of top level battlers, because we want the ruleset to be based on top level battling. If you introduce lower level battlers just because they used a certain pokemon in their team(s) then you are distorting the results of the test.
Wrong. The point of the Suspect Test is the canvas the opinion of those who experience the Suspect. Otherwise we would still be using bold voting and Rating/Deviation as our only requirements. We want our ruleset to be based on people who know what they're talking about, not people who are good at laddering.

It seems to me, from what you have said of it, that suspect experience is either trivial, or it is creating a bias, probably in favour of voting suspects uber. I really cant see how it could be anything different. Yes there will be cases where the opposite will happen, but that doesnt change the fact that there was a bias present..

Have a nice day.
The way you and everyone have thrown around bias with a pejorative connotation has me and Doug baffled. Yes, technically, we are biased. Biased towards the opinions of those who experience Suspects, and not those who can ladder up without using or seeing the Suspect. Hell, I'm also biased when reading voting submissions. Biased towards people who can demonstrate a grasp of the English Language and the fundamentals of a convincing, logical argument. You guys are regarding a "bias" towards those who we objectively see are experiencing the suspect more than others like it is inherently a bad thing.
 

DougJustDoug

Knows the great enthusiasms
is a Site Content Manageris a Top Artistis a Programmeris a Forum Moderatoris a Top CAP Contributoris a Battle Simulator Admin Alumnusis a Smogon Discord Contributor Alumnusis a Top Tiering Contributor Alumnusis an Administrator Alumnus
You have to "experience" a pokemon to know much about it. I quote the word "experience" -- because I am not just talking about "using" it. There are many facets to "experiencing" a Pokemon.

Does putting a suspect on your team connote "experience"? Yes, it does. Because, presumably, you had to think about that pokemon's role in relation to other team members. If you later battle 100 times, and 6-0 every opponent with your lead pokemon and never use the suspect in battle -- I could argue you still got a little bit of experience with the suspect. More experience than people that never even bothered to try and put the pokemon on their team.

Does using a suspect in battle connote "experience"? Yes it does. Because despite all your best laid plans during team building, there's nothing like sending that pokemon out in battle and seeing if it works the way you planned. Whether the suspect works or not, you probably learned something about the pokemon by battling with it.

Does facing a suspect in battle connote "experience"? Yes it does. Because despite all your team building planning, and all your battle experiences with using a suspect -- there's always other people out there with different ideas and different strategies. By facing a suspect wielded by a different battler, you learn new things about that pokemon and how it can be used.

I won't keep going into finer details of experience. Hopefully, you get my point. I mention it here again (we've said it many times before) because Jumpman and I have continually referred to "experience" with relation to SEXP, and yet many people in this thread and other places continue to imply that SEXP = "Using a pokemon". And because these implications are being repeated over and over by people (CIM, SDS, etc -- I'm looking at you) -- members of the community reading these discussions now accept it as fact that "SEXP = Using a pokemon". Well, it is not true.

And don't give me that bullshit, "Well, you won't disclose exactly what is in SEXP, so I don't really know what is in it." That is NOT an excuse to continue to proliferate the lie that "SEXP = Using a pokemon". That reasoning is basically saying, "Well, I don't know the full truth. So I will intentionally make false statements and encourage others to believe it. And until you disclose the full truth, how do we really know that my statements AREN'T true?" If anyone out there thinks that kind of reasoning is good, logical thought -- then I will no longer dignify your arguments with a response.

So, I have given a clear explanation of what we mean when we refer to "experience", regardless of what other lies you have heard to the contrary. Which brings me to the real issue that needs to be clarified:

"Does a player need experience with a pokemon in order to participate in deciding its tiering status?"

The current Smogon tiering leaders believe the answer to that question is a resounding "YES". All of our current processes and procedures are based around the notion that tiering should be decided by people with three qualifications:
1) They are smart
2) They are skilled battlers
3) They have experience with the pokemon being tested
There is no way to 100% accurately measure those three characteristics.

We try to measure #1 by reading a submitted paragraph and analyzing the quality of the arguments and reasoning. It's subjective as hell, I know. And it can be gamed, by getting someone else to write your paragraph, or copying from someone else. But, since we can't rely on people submitting IQ scores or anything like that -- it's pretty much the best thing we have. If you have better ideas about how to ensure our tiering voters are smart, I'm interested to hear it. If you disagree that voters should be smart -- that's fine too. But, be aware that we probably aren't going to change that goal any time soon.

We try to measure #2 with ladder ratings and deviation scores. It's an imperfect measurement, I know. If anyone has bothered to look at the details of Glicko2, and the way we use it in Shoddy Battle -- it's really not a great way to represent battle skill for competitive pokemon. But, it's the best we have right now. When we get X-ACT's GLIXARE system in place, maybe that will improve. But, for now, we are using the the Glicko2 ratings to identify skilled battlers. If you disagree that voters should be skilled battlers -- that's fine too. But, be aware that we probably aren't going to change that goal any time soon.

Which brings us to #3. Prior to gathering SEXP data -- there was no way to measure experience (as defined above). We could only hope that smart people with good ratings would actually go out and experience (as defined above) the suspects being tested. However, we knew there was actually no way to ensure that with our previous process. And based on some rumors and innuendo, we were suspicious that some "qualified voters" actually had little to no "experience" with the pokemon being tested. In our opinion, that is a very bad thing.

Many pokemon players have deep-seated preconceived notions about certain pokemon. Maybe it developed from their ingame play. Maybe it comes from their theorymon impressions they got when they first saw the pokemon's dex entry. Maybe these players have heard lots of rumors and believed them. Whatever the reason, we know that some players will develop strongly held beliefs about a pokemon's competitive relevance -- without actually experiencing that pokemon very much in actual competitive play. We knew this was true prior to developing SEXP -- but we had no way to prove it objectively.

With SEXP, we have an objective measurement. Is that measurement a perfect representation of "experience" with a pokemon? No, far from it. In fact, SEXP is no more accurate than the other measurements we use for determining voter qualifications. We know that paragraphs are a shitty way to measure intelligence -- but it's the best measurement we have right now. Glicko2 ratings are a shitty way to measure pokemon battle skill -- but it's the best measurement we have right now. And SEXP is a shitty way to measure experience -- but it's the best measurement we have right now.

I know many people that object to SEXP, are simply upset that the data is not public. You argue, "Well I don't know if SEXP is good or bad. Only the admins have seen it." Yeah, well... get over it.

We don't disclose the exact reasons for rejecting every voter's paragraph logic either -- and no one has ever complained about that at all. For all you know, Jumpman may be rejecting paragraphs based on whether he likes the username of the battlers. All you have is a vague description of the general criteria Jumpman SAYS they use in evaluating paragraphs. Beyond that, you simply have to trust that paragraph evaluators are doing a decent job.

We have rejected voters for cheating their battles, and I have never disclosed the algorithms we use for detecting cheaters. For all you know, we may be rejecting voters based on bad data. If you ask the cheaters, they will certainly deny any wrongdoing, and claim that our information is erroneous. Ultimately, you have trust that we are doing a decent job with cheat detection.

SEXP is no different than some of these other unpublished aspects to voter qualification. Yet for some reason, you people have started flag-waving about "admin secrets", as if it's some big cover-up. And personally I'm sick of it. I'm sick of people ignoring published statements about what "suspect experience" means. And I'm sick of people beating conspiracy drums about the things that aren't published. Suspect EXP is a useful tool, that's all. It's not a perfect tool, but nothing in the tiering process is perfect.
 

Cathy

Banned deucer.
If anyone has bothered to look at the details of Glicko2, and the way we use it in Shoddy Battle -- it's really not a great way to represent battle skill for competitive pokemon. But, it's the best we have right now. When we get X-ACT's GLIXARE system in place, maybe that will improve. [...] Glicko2 ratings are a shitty way to measure pokemon battle skill -- but it's the best measurement we have right now.
GLIXARE is just a different rating estimate; it is not a different rating system. In particular, it wouldn't affect how rating and deviation have been used throughout this process at all, since it doesn't affect the computation of those quantities. What are the problems with glicko2 that you allude to? It seems like a good rating system to me.
 

Hipmonlee

Have a nice day
is a Community Contributoris a Senior Staff Member Alumnusis a Smogon Discord Contributor Alumnusis a Tiering Contributor Alumnusis a Top Contributor Alumnusis a Battle Simulator Moderator Alumnusis a Four-Time Past WCoP Champion
Second, you can only guess or hope that they might have done better with the Suspect in an attempt to help your argument. But do you honestly think these two players or anyone else wouldn't use Latios or any Suspect if they felt it would help them not only win, but to gain the SEXP that I even then had hinted you get by using the Suspect? Do you think they would intentionally do a disservice to themselves and their chances of qualifying by not using a Suspect they really felt would give them a competitive advantage? Do you think Stathakis would have risked missing the mark by half a deviation point if he knew using Latios on his team would grant him a definite competitive advantage?
Sure, I mean I dont know them, but they could be lazy, they could battle under some kind of code of honour that prevents them from using pokemon that they believe to be uber, they could have wanted a challenge.. It's not hard to come up with reasons for this.. I could also go into depth explaining situations where a broken pokemon would be a disadvantage to you when you are laddering, but I think that would be getting away from the topic at hand.

The point is you were guessing and hoping that their reason for not using the suspect was for competitive advantage, when that very easily could be not the case.

And they qualified comfortably because they still actually gained a lot of SEXP despite not using Latios, which contrasts both with the overriding fear that you can't get SEXP if you don't use them and the assumption that people who are inclined to think a Suspect is BL/Uber will be inclined to use it.
So why were you using them as an example of the benefits of SEXP? This is my point, in this case it seems to be trivial..

Wrong. I've presented this to you this way because saying more would divulge more details about the SEXP metrics than I want. Even regardless of that, I specifically said that "I know all of this because of SEXP." You can call it semantics all you want, but I purposely didn't say "I know this by looking at their SEXP" or "their SEXP reflected this phenomenon". Can you conceive of any reason anyone would have to look at anyone's separate win-loss ratios with and without a Suspect? Why would Doug, the only person privy to such numbers before I had the idea of Suspect Experience, have any cause to dig through millions of cells of battle data to look at something like that?
Isnt that what you have done several times in this thread? I would think a person would do it to check if a person is lying about their useage in their submittal.

Are you guys finally getting the idea that maybe I actually have thought about the metrics of the Suspect EXP formula for a long, long time, and that maybe the rest of you haven't thought of things like why I would possibly be able to reference any player's separate win-loss ratios with and without a Suspect at a glance, and why I think that might matter?
I dont trust your judgement jumpman. How could I when you and I disagree about basically everything. A lot of things both of us have thought about over long, long periods of time. Why would I assume I would agree with you now. Especially when all the evidence seems to point to the fact that SEXP is a bad idea.


You take issue with it because you don't know what it is. As was belabored on IRC 15 hours ago, there are reasons Doug and I and now the UU Suspect Test heads are going to keep the formula private, among them the ability to easily game the system by knowing the metrics. The popular rebuttal to that has been "any metric that can be gamed is a poor metric", to which I would remind our community that every single Suspect Test ever has been more easily gamed without SEXP than with.

The opposition to the concept of SEXP seems to be forgetting how easy it was for people to actually vote. When we did bold voting as the only requirement for Wobbuffet and Deoxys-S, one could easily vote without having played on the Ladder with or against these Suspects, a concern that's completely addressed by SEXP. When Rating/Dev was the only requirement for Garchomp, one could easily vote without having any experience with or against Garchomp (which was kind of impossible since Garchomp wasn't actually on the Suspect Ladder for that month but my point stands), a concern that's completely addressed by SEXP. When a combination of both bold voting and Rating/Dev was implemented for Deoxys-S again and Shaymin-S, one could still vote without having picked up concrete experience on the Suspect, a concern that's completely addressed by SEXP. SEXP has improved upon the major flaws of every single iteration of the Suspect Test Process.
Lets not use the shaymin test as an example because that was a disaster anyway. There was no way any voter had insufficient experience of Garchomp or Deoxys. They were really everywhere at the point they were tested..

Wrong. The point of the Suspect Test is the canvas the opinion of those who experience the Suspect. Otherwise we would still be using bold voting and Rating/Deviation as our only requirements. We want our ruleset to be based on people who know what they're talking about, not people who are good at laddering.
But if you are good at laddering you must know what you are talking about. Using a suspect a lot doesnt have anything to do with competence. When you ladder in a metagame where the suspect exists then your battles are affected by that suspect even when they are not used in your battle.

For instance, consider stall teams and Shaymin. There is really not much reason to use Shaymin on a stall team. So if you encourage people to use Shaymin, you are effectively encouraging people to not use stall. In doing so you are biasing the suspect test result away from the likely result of a stall players vote.

Even if the sexp concept was hidden, then you are still biasing the list of final voters away from people who use stall teams. By adding a number of battlers at the end of the test all of whom, we can assume dont use stall.

The way you and everyone have thrown around bias with a pejorative connotation has me and Doug baffled. Yes, technically, we are biased. Biased towards the opinions of those who experience Suspects, and not those who can ladder up without using or seeing the Suspect. Hell, I'm also biased when reading voting submissions. Biased towards people who can demonstrate a grasp of the English Language and the fundamentals of a convincing, logical argument. You guys are regarding a "bias" towards those who we objectively see are experiencing the suspect more than others like it is inherently a bad thing.
The problem isnt that you are biased, the problem is that people who use a pokemon are going to be more likely to vote it uber. Or perhaps not, perhaps the majority of people battle with a code of conduct that prevents them from using ubers, in which case people who use a suspect are going to be more likely to vote OU. It doesnt matter, what does matter is you are changing the voter pool based on a criteria that is likely to affect how people vote. There is no reason I can think of that people who speak English are more likely to vote one way or another. A bias towards english speakers exhibited by you will not bias the outcome of the vote. A bias toward people who battle a certain way will.

I won't keep going into finer details of experience. Hopefully, you get my point. I mention it here again (we've said it many times before) because Jumpman and I have continually referred to "experience" with relation to SEXP, and yet many people in this thread and other places continue to imply that SEXP = "Using a pokemon". And because these implications are being repeated over and over by people (CIM, SDS, etc -- I'm looking at you) -- members of the community reading these discussions now accept it as fact that "SEXP = Using a pokemon". Well, it is not true.
Well, perhaps you can assure me that this isnt an issue, but if a person uses the suspect on their team, then they must have more uses of the suspect as a battler with similar results who didnt use it.

Whether or not you have a minimum number of uses requirement, or some kind of logarhythmic multiplier, or whatever the hell you use, at some point along the line you will be adding battlers who used the suspect with similar records to other battlers who did not use the suspect who you will not be adding. Otherwise, it seems to me that suspect experience is completely trivial. IE if you arent adding people who used the suspect and similar level battlers who arent using the suspect then you arent doing anything with SEXP except checking for liars which you have said is not what SEXP is for. This is why I am talking about you adding voters who used the suspect.

In doing so you are adding a number of voters to the voter pool, based on a criteria that will have some corellation to how these voters will vote. This is adding a bias, and this is what I have a problem with.

We don't disclose the exact reasons for rejecting every voter's paragraph logic either -- and no one has ever complained about that at all.
Just to clarify, I am even more opposed to voter paragraphs than I am to suspect experience. I dont complain because I have given up.

SEXP is no different than some of these other unpublished aspects to voter qualification. Yet for some reason, you people have started flag-waving about "admin secrets", as if it's some big cover-up. And personally I'm sick of it. I'm sick of people ignoring published statements about what "suspect experience" means. And I'm sick of people beating conspiracy drums about the things that aren't published. Suspect EXP is a useful tool, that's all. It's not a perfect tool, but nothing in the tiering process is perfect.
sexp is different to cheat detection at least, because it seems like it is a bad idea regardless of how you are implementing it.

The only time SEXP seems to have any significance is when you have a test where a large number of people dont use the suspect at all but then go on to vote uber. It seems like SEXP is unnecessary in this case while you have the paragraph test. The obvious result of a test like this one is the pokemon will drop to OU unless people have a very good argument it should be uber. That seems to be a case of the test working perfectly. Honestly I really cant imagine a situation where SEXP is necessary..

Have a nice day.
 

cim

happiness is such hard work
is a Contributor Alumnusis a Smogon Media Contributor Alumnus
Hipmonlee basically touched on everything I wanted to add (keep up the good fight kind sir ^_^), but one thing.

None of the UU judges know the SEXP formula. How are they supposed to know anything about the number they get back and what it means? It would be like looking at a cube and saying it is 28139 big, when you don't know units or if they're referring to volume or surface area or whatnot. I don't see how any of them can be comfortable judging people based on a number that can't mean anything to them other than the relative size of the numbers to each other, which would be at best poor.
 

Hipmonlee

Have a nice day
is a Community Contributoris a Senior Staff Member Alumnusis a Smogon Discord Contributor Alumnusis a Tiering Contributor Alumnusis a Top Contributor Alumnusis a Battle Simulator Moderator Alumnusis a Four-Time Past WCoP Champion
Just thinking about if there was a situation where the majority of voters were following a code of conduct, then in that case there might be a case for having some kind of suspect experience to address an existing bias.

However, I dont think that that is the case, at least not often enough to make some kind of suspect experience qualification justifiable. Even if the suspect experience is a minor adjustment, the fact that people will change their teams because of it means its impact will be exaggerated to way beyond the point of any existing biases.

Also, I still think it would be a far better idea to just explain to people where you think this bias exists and trust them to test in good faith (something that hasnt been tried yet, to my knowledge, or was only tried in the Garchomp and Deoxys votes, which seem to have been the most convincing results thus far).

Honestly, all this complexity just seems to be a massive overreaction to 5krunners voting, and yet, I dont even think it's a particularly big deal. Yes you will always have some noise in a vote from people voting for shitty reasons, but why not just accept that and try to have big enough voter pools to counteract that. It would just be so much easier for everyone.

Have a nice day.
 

Jumpman16

np: Michael Jackson - "Mon in the Mirror" (DW mix)
is a Site Content Manager Alumnusis a Top Team Rater Alumnusis a Battle Simulator Admin Alumnusis a Smogon Discord Contributor Alumnusis a Researcher Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis an Administrator Alumnus
Sure, I mean I dont know them, but they could be lazy, they could battle under some kind of code of honour that prevents them from using pokemon that they believe to be uber, they could have wanted a challenge.. It's not hard to come up with reasons for this.. I could also go into depth explaining situations where a broken pokemon would be a disadvantage to you when you are laddering, but I think that would be getting away from the topic at hand.
You're reaching and you know it. I could say the same things about people who will only use a Suspect because it's a Suspect and it will definitely give them a competitive advantage, but that is pointless and baseless conjecture. The point is that Chris is wrong to say that people who feel a pokemon is uber or BL will use it, and you're silly to defend him when you know I'm right.

The point is you were guessing and hoping that their reason for not using the suspect was for competitive advantage, when that very easily could be not the case.
What difference does it make, Hip? Besides the fact that you are grasping at straws to think of reasons that the two didn't use Latios that are actually more reasonable than the possibility that they felt they had an advantage by not using Latios, "the point is that Chris is wrong to say that people who feel a pokemon is uber or BL will use it, and you're silly to defend him when you know I'm right."

So why were you using them as an example of the benefits of SEXP? This is my point, in this case it seems to be trivial..
They were both able to amass a very respectable amount of SEXP without using Latios. Aside from Doug's subtle warning that people should stop bitching about the formula, you probably should have read his post before responding to me for this reason alone. Doug clearly stated: 'And don't give me that bullshit, "Well, you won't disclose exactly what is in SEXP, so I don't really know what is in it." That is NOT an excuse to continue to proliferate the lie that "SEXP = Using a pokemon".' This has been the main gripe about SEXP aside from the recent questioning of its integrity.

Isnt that what you have done several times in this thread? I would think a person would do it to check if a person is lying about their useage in their submittal.
I'm talking specifically about "separate win-loss ratios with and without a Suspect", not using SEXP to catch cheaters. Did you really not realize that I would bring this up to underline one of the many other uses SEXP and how it is calculated has?

I dont trust your judgement jumpman. How could I when you and I disagree about basically everything. A lot of things both of us have thought about over long, long periods of time. Why would I assume I would agree with you now. Especially when all the evidence seems to point to the fact that SEXP is a bad idea.
This is an incredibly foolish thing to say, Hip. Besides the fact that there is no way you've thought about SEXP as long as I have, which makes your comparison moot, you have stated that you indirectly do not trust Doug's judgment by allowing SEXP to be used for OU, UU and Cap for eight months now. I honestly don't care if you agree with me, but to come out and state that you don't trust my judgment is beyond appalling.

You may want to read the things I'm about to list as back-patting, but I assure you I have no remote desire to toot my own horn to prove my superiority or brilliance. So that being said: I have read the submissions of hundreds of people in a year of the Suspect Test, and decided who should vote and who should not for every test but Garchomp's, which was just Rating/Dev. I spent seven months encouraging discussion on and shaping the characteristics of uber that Tangerine finalized, characteristics that both OU and UU now follow. I have started and led the discussion on how the tests should be conducted, which allowed Jabba to come up with his fantastic ideas that are now Stages 1 and 2, and I then came up with Stage 3 by myself. And I thought of the idea of SEXP in a talk with Aeolus, and then took it as my own responsibility to follow through on the idea and make the SEXP formula myself before showing it to Doug.

If you don't trust my judgment, Hip, how on earth can you think that our tiers now are anything but a sham? Why would you even waste your time associating with the Suspect Test Process or anything to do with Policy on competitive pokemon? What you've just said extends far beyond the aim of using SEXP for the UU tests—you have just stated that you are not behind anything in competitive pokemon that I have touched. You realize how serious that is, right?

Lets not use the shaymin test as an example because that was a disaster anyway. There was no way any voter had insufficient experience of Garchomp or Deoxys. They were really everywhere at the point they were tested..
Garchomp's test, as I stated, was conducted without Garchomp on the Suspect Ladder. To say that there's no way anyone would have had insufficient experience with it is at once false. There are many, many people who stopped playing Standard because they didn't want to play OU with Garchomp in it, but were eager to play the Suspect Ladder not only to play OU without Garchomp but in order to vote. DXS had been in standard for a long time, but you are purposely downplaying how easily SEXP makes it so that we do not have to have played with a Suspect on the Standard Ladder for 4-5 months before everyone gains experience on it.

The Skymin test failed in part because there was no metric in place to ensure the voters had sufficiently experienced it. To throw that whole test out because ratings weren't reset without saying one way or another whether the presence of SEXP—the point of the debate—would have helped it is kind of cheap. But fine, we don't have to use the Skymin test as an example, I listed it for completeness's sake. Are you going to continue to ignore other points I make that I can only therefore assume you concede?

But if you are good at laddering you must know what you are talking about. Using a suspect a lot doesnt have anything to do with competence. When you ladder in a metagame where the suspect exists then your battles are affected by that suspect even when they are not used in your battle.
I don't know how many times I have to state that this is fundamentally wrong. I'm actually pretty sure most people don't agree that mere ladder prowess dictates that someone knows what they're talking about regarding a Suspect. It's so wrong I'm not even going to waste my time explaining it again.

For instance, consider stall teams and Shaymin. There is really not much reason to use Shaymin on a stall team. So if you encourage people to use Shaymin, you are effectively encouraging people to not use stall. In doing so you are biasing the suspect test result away from the likely result of a stall players vote.
Or a stall player can use a stall team that beats Skymin pretty easily, which isn't out of the realm of possibility at all with how worn down Skymin gets with Stealth Rock, Sandstorm and a simple Calm Zapdos. Considering that SR was the #3 most used move, and Zapdos was #4 in usage on the Standard Ladder where Skymin was tested, ahead of Blissey at #5, and that Tyranitar was #9, this is a reasonable "theorymon" that's actually steeped in fact. And since this stall player is actually winning the battles, he's gaining a comparable amount of SEXP to the people who are inclined, for whatever reason, to use Skymin but lose to his stall team.

And again, you say "stall player" just like last year when you said that you didn't like DXS because it didn't let you use offensive teams. This is competitive pokemon, and on the Suspect Ladder, you are going to have to adapt to win even if you prefer one style of play. That doesn't really have anything to do with SEXP, though. so whatever.

Even if the sexp concept was hidden, then you are still biasing the list of final voters away from people who use stall teams. By adding a number of battlers at the end of the test all of whom, we can assume dont use stall.
No, you can't assume that, for reasons I have already stated. Even if your assumption that stall players would never use Skymin were correct, this does not explain why players who don't use the Suspect are able to make Upper Requirement numbers as well as amass a lot of SEXP. The people who make the SEXP cutoff had to actually have a decent record in battles that featured the Suspect in order for Aeolus and me to place stock in their vote.

The problem isnt that you are biased, the problem is that people who use a pokemon are going to be more likely to vote it uber. Or perhaps not, perhaps the majority of people battle with a code of conduct that prevents them from using ubers, in which case people who use a suspect are going to be more likely to vote OU. It doesnt matter, what does matter is you are changing the voter pool based on a criteria that is likely to affect how people vote. There is no reason I can think of that people who speak English are more likely to vote one way or another. A bias towards english speakers exhibited by you will not bias the outcome of the vote. A bias toward people who battle a certain way will.
Good thing Doug went out of his way to explain once and for all that using a Suspect is not the be all and end off of SEXP before you posted. And you missed my point about the grasp of the English language, as there are several voters on either side of the uber/ou line who have a lot of trouble forming convincing arguments, or at least following the directions that Aeolus and I put forth after every test, which means I am much more likely to throw them out. I didn't mean literal English speakers.

None of the UU judges know the SEXP formula.
They all know the SEXP formula, as Doug and I spoke with the UU heads about it yesterday. Are you guys trying to set some kind of shitty assumption record or something? You ignore the entirety of the post where I respond to you directly to post one single small issue you have that's based on an assumption that is flat out wrong? Really?
 

cim

happiness is such hard work
is a Contributor Alumnusis a Smogon Media Contributor Alumnus
They all know the SEXP formula, as Doug and I spoke with the UU heads about it yesterday. Are you guys trying to set some kind of shitty assumption record or something? You ignore the entirely of the post where I respond to you directly to post one single small issue you have that's based on an assumption that is flat out wrong? Really?
I absolutely don't see the need for disrespect when I clearly didn't ignore the post, but merely said that Hipmonlee addressed all of my major points. Frankly because you berated me in #is for saying "me too" all the time I thought it would be nice not to repeat someone else's points, but apparently there's no way to be in the right and avoid attacks here. I'm rather insulted by your demeanor and the way you're treating me for what's honestly a good faith disagreement and an honest attempt to find the best possible solution.

As of the previous day the UU heads did not know about the post. I'm sorry you think that not assuming things changed in a 24 hour period is a "shitty assumption", when there was absolutely no way for me to find out and you were beyond adamant about not letting anyone touch the formula with a ten foot poll a day prior. Honestly there's disagreement and I've made every attempt in this thread to stay 100% civil but I really do not wish to debate anything when I get berated as a response to posting a point (no matter how "bad" you may see it as).
 

Jumpman16

np: Michael Jackson - "Mon in the Mirror" (DW mix)
is a Site Content Manager Alumnusis a Top Team Rater Alumnusis a Battle Simulator Admin Alumnusis a Smogon Discord Contributor Alumnusis a Researcher Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis an Administrator Alumnus
Don't post shitty assumptions that comprise 100% of your post and you won't be "berated". It's that simple.

And I don't even know why you would think that Doug and I wouldn't eventually bring them up to speed before SEXP would be used in the UU test anyway. That's an even worse assumption than us possibly not having told them what SEXP was yesterday, and an insult to the intelligence, foresight, and flexibility of both Doug and myself.
 

cim

happiness is such hard work
is a Contributor Alumnusis a Smogon Media Contributor Alumnus
Now that's a stretch and you know it. You can't be actually insulted by that or have gone "wow Chris, what nerve he has to disrespect us to assume we wouldn't do that" when the post was worded more or less to ask of the UU judges how they can have an opinion on using the formula rather than in any way calling you guys inept, incompetent, or whatever. You're just trying to cop out.

I really don't want to be a part of a community or discussion or whatever where it's okay for staff to berate and insult someone because they're "wrong". It's that simple.
 

Hipmonlee

Have a nice day
is a Community Contributoris a Senior Staff Member Alumnusis a Smogon Discord Contributor Alumnusis a Tiering Contributor Alumnusis a Top Contributor Alumnusis a Battle Simulator Moderator Alumnusis a Four-Time Past WCoP Champion
You're reaching and you know it. I could say the same things about people who will only use a Suspect because it's a Suspect and it will definitely give them a competitive advantage, but that is pointless and baseless conjecture. The point is that Chris is wrong to say that people who feel a pokemon is uber or BL will use it, and you're silly to defend him when you know I'm right.
Whether he was wrong or not to say that doesnt change his point. Ok yes he didnt use semantically flawless english, but you havent actually adressed his point, which is that you are artificially encouraging the use of suspects. You have only addressed his semantics. I'm trying to explain his point (at least as I see it), not defend him.

They were both able to amass a very respectable amount of SEXP without using Latios. Aside from Doug's subtle warning that people should stop bitching about the formula, you probably should have read his post before responding to me for this reason alone. Doug clearly stated: 'And don't give me that bullshit, "Well, you won't disclose exactly what is in SEXP, so I don't really know what is in it." That is NOT an excuse to continue to proliferate the lie that "SEXP = Using a pokemon".' This has been the main gripe about SEXP aside from the recent questioning of its integrity.
I never said that.

I'm talking specifically about "separate win-loss ratios with and without a Suspect", not using SEXP to catch cheaters. Did you really not realize that I would bring this up to underline one of the many other uses SEXP and how it is calculated has?
I dont care about the many other uses, I care about the main use, which as far as I can tell is distorting our vote results.

This is an incredibly foolish thing to say, Hip. Besides the fact that there is no way you've thought about SEXP as long as I have, which makes your comparison moot, you have stated that you indirectly do not trust Doug's judgment by allowing SEXP to be used for OU, UU and Cap for eight months now. I honestly don't care if you agree with me, but to come out and state that you don't trust my judgment is beyond appalling.
I dont even know Doug, I've spoken to him like maybe twice in the whole time he has been at smogon.

What I mean here is the fact that I know you have thought about something for a long, long time doesnt mean I should accept it as a good idea. Perhaps I didnt choose the right words, but honestly I dont think I implied any more than what I just explained. If you think I did, then my word choice was poor.

Also to say I dont trust someone's judgement doesnt mean that I think everything they have done is wrong. I dont agree with a lot of your decisions regarding the tiering process, you know that. So, when you make a new decision, I dont see why I would trust it just because you have thought about it a long, long time, when in the past, you have made decisions after thinking about them a long, long time that I have disagreed with.

If you don't trust my judgment, Hip, how on earth can you think that our tiers now are anything but a sham? Why would you even waste your time associating with the Suspect Test Process or anything to do with Policy on competitive pokemon? What you've just said extends far beyond the aim of using SEXP for the UU tests—you have just stated that you are not behind anything in competitive pokemon that I have touched. You realize how serious that is, right?
No you have misunderstood me. I dont need to trust your judgement to support those things. I can make judgements on them myself. And this isnt a case of me saying "you need to make the sexp formula public" honestly, I dont really care about that. What I care about is the fact that it exists in the first place.

chaos suggested I make a new thread where I recommend going back to the basics with this testing process. So I am going to do that. But I was halfway through this post and I thought I should clarify the point about judgement, and it seems like a waste to delete the rest of it.

Have a nice day.
 

X-Act

np: Biffy Clyro - Shock Shock
is a Site Content Manager Alumnusis a Programmer Alumnusis a Smogon Discord Contributor Alumnusis a Top Researcher Alumnusis a Top CAP Contributor Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis a Smogon Media Contributor Alumnusis an Administrator Alumnus
If anyone has bothered to look at the details of Glicko2, and the way we use it in Shoddy Battle -- it's really not a great way to represent battle skill for competitive pokemon. But, it's the best we have right now. When we get X-ACT's GLIXARE system in place, maybe that will improve. But, for now, we are using the the Glicko2 ratings to identify skilled battlers.
If you're using the Glicko2 ratings and not the conservative rating estimates, then those are good ways to represent battling skills. Between the two lines below is an explanation of how this representation works:

---------------------------------------

Forget rating deviations for now, and just assume that they are zero (meaning the ratings are perfectly certain - which is impossible in practice - but this is basically what the ELO rating system assumes, on which the Glicko and Glicko-2 systems are based). If someone is rated, say, 1600, then his or her partial score Q is equal to 10^(1600/400) = 10^4 = 10000. Another player rated, say, 1500, would have partial score 10^(1500/400) = 10^3.75 = 5623.41.

What do these partial scores represent? They represent the chance by which a player is expected to win against another player. In our example above, the 1600 player has 10000/5623.41 = 1.778 times greater chance of winning than the 1500 player. Of course, the 1500 player has 5623.41/10000 = 0.562 times greater chance of winning than the 1600 player. So, in 2778 games against the 1500 player, the 1600 player is expected to win 1778 of them and lose 1000 of them.

This means that the 1600 player is expected to win 1.778 / (1.778 + 1) = 1.778 / 2.778 = 0.64 of the games, or 64%, played against the 1500 player, while the 1500 player is expected to win 0.562 / (0.562 + 1) = 0.562 / 1.562 = 0.36 of the games, or 36%, played against the 1600 player. (Note that 64% + 36% = 100%.)

So basically the ratings in the ELO (and Glicko and Glicko-2 systems) are in a logarithmic scale, since, to find the player's chance of winning against another player, we take the antilog of the ratings divided by 400 (Elo probably multiplied the rating by 400 so that it is not a small decimal), and then divide these two numbers by each other.

As we saw above, a difference of 100 in two players' ratings signifies that the higher rated player is expected to win 64% of the games played against the lower rated one, whether their ratings are 1500 and 1600, 2000 and 2100, or 800 and 900. This percentage can be found more quickly using a separate formula thus:

Probability of winning = 1 / (1 + 10^(difference in ratings / 400))

(Note that to find probability that Player A wins against Player B, when you calculate the difference in ratings, the rating of Player A is subtracted from the rating of Player B, not vice-versa.)

This is essentially the same formula used by Glickman in his Glicko and Glicko-2 rating systems, except that he also introduced a measure of uncertainty in the ratings, called the rating deviation, which skews the probabilities produced by the formula above depending on how certain or not are the two ratings.

-------------------------------------------------

As far as I'm concerned, the only reason why the ratings of the players produced by the Glicko-2 system aren't reflecting the true capabilities of the players is that the players have alt accounts. This is the only thing that is hindering us from having a good measure of a player's playing prowess. You know that I have always advocated to have every player have only one account, and I'll repeat it again. I know that this will once again fall on relatively deaf ears, but, if you really want a rating system that truly reflects the capabilities of a player, you need to ban the use of alts accounts.
 

Hipmonlee

Have a nice day
is a Community Contributoris a Senior Staff Member Alumnusis a Smogon Discord Contributor Alumnusis a Tiering Contributor Alumnusis a Top Contributor Alumnusis a Battle Simulator Moderator Alumnusis a Four-Time Past WCoP Champion
That concern assumes people always battle to the best of their abilities as well doesnt it? Which prevents them from testing on the ladder, which is a big part of what the ladder is used for.. Particularly the suspect ladders which were basically created for that single purpose.

Have a nice day.
 

DougJustDoug

Knows the great enthusiasms
is a Site Content Manageris a Top Artistis a Programmeris a Forum Moderatoris a Top CAP Contributoris a Battle Simulator Admin Alumnusis a Smogon Discord Contributor Alumnusis a Top Tiering Contributor Alumnusis an Administrator Alumnus
I really don't want to get too far into a debate about our ratings, but since Colin and X-Act have both mentioned my quote about Glicko ratings, I will clarify it.

I said:
If anyone has bothered to look at the details of Glicko2, and the way we use it in Shoddy Battle -- it's really not a great way to represent battle skill for competitive pokemon.
I bolded part of my quote, because I don't think it was clear that it is very important to my overall criticism of our ratings. That bold part is a vague reference to the fact that Shoddy Battle allows players to make alts, which undercuts the fundamental assumption that you are rating an identifiable player over time. I think the current system of alternate accounts on Shoddy significantly degrades the quality of our ratings. I have talked about this many times in the past. X-Act mentioned this exact same gripe at the end of his post, so I know I am not the only one who thinks our current ratings have huge problems.

And, as I understand it, the GLIXARE ratings estimate has been geared to allow ratings estimates to change more quickly, which will hopefully remove one of the motivations for players to make alts. That is the reason I said this:
When we get X-ACT's GLIXARE system in place, maybe that will improve.
I hope this clarifies what I meant.
 

X-Act

np: Biffy Clyro - Shock Shock
is a Site Content Manager Alumnusis a Programmer Alumnusis a Smogon Discord Contributor Alumnusis a Top Researcher Alumnusis a Top CAP Contributor Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis a Smogon Media Contributor Alumnusis an Administrator Alumnus
That concern assumes people always battle to the best of their abilities as well doesnt it? Which prevents them from testing on the ladder, which is a big part of what the ladder is used for.. Particularly the suspect ladders which were basically created for that single purpose.

Have a nice day.
Well, there are two facets to testing. One is to test a game to see if it is balanced. The other is to test a strategy to see if it works in an already balanced game. Let me give an example. Test One would be like testing a variant of chess in which pawns can also capture the piece in front of them, to see if this makes the game of chess more balanced. Test Two would be like testing a new variation to the Sicilian defense in chess. In test one, we alter the rules and we test them for balance; in test two, we leave the rules as they are and we test a strategy to see if it works.

Tests like test two should never be done on the ladder in which people are assumed to always battle to the best of their abilities. A competitive chess player doesn't test a new variation in a tournament; he tests and studies it in private.

In the case of us testing Pokemon to determine their tiers, that is more akin to test one, though. In this case, we aren't playing the 'true' game, but we're still determining whether the game is 'true' or not. In the case of UU, the true UU metagame hasn't been determined yet, so actually nobody has been playing the real UU yet. Hence the ratings only show a player's prowess in the current testing metagame only (assuming no multiple accounts are used). As soon as the metagame changes by removing and/or adding new Pokemon, the ratings should automatically be reset, to reflect that we are now testing a completely different metagame, where the previous ratings don't mean anything.
 

Jumpman16

np: Michael Jackson - "Mon in the Mirror" (DW mix)
is a Site Content Manager Alumnusis a Top Team Rater Alumnusis a Battle Simulator Admin Alumnusis a Smogon Discord Contributor Alumnusis a Researcher Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis an Administrator Alumnus
Whether he was wrong or not to say that doesnt change his point. Ok yes he didnt use semantically flawless english, but you havent actually adressed his point, which is that you are artificially encouraging the use of suspects. You have only addressed his semantics. I'm trying to explain his point (at least as I see it), not defend him.
He already contradicted himself on that "point" on IRC three nights ago. When I asked why SEXP actually wouldn't be better in UU than in OU since in UU you actually won't know the Suspects until after the SEXP is used, Chris said that it would affect the actual metagame and people will still know what is going to be nominated and voted a suspect and therefore it would "contaminate the data". I said that this is the way it's done now in OU Suspect Tests, and both he and SevenDeadlySins said that the UU test is different. When I asked how, Chris said "because in uu you don't know the suspects", which I immediately called out as a contradiction and SDS shortly agreed (relevant because he's also debating against me in addition to, if anything, being biased towards supporting Chris).

No matter what, the UU metagame is going to be artificially altered or whatever you want to call it, because you don't need the looming spectre of SEXP to prompt people to whisper and talk about what's going to be nominated and what's probably going to be a suspect. I made my point about starting from the beginning with the "six deadly suspects" for a reason—if you want to see retroactive SEXP numbers on those pokemon in UU, you will see just how much people were influenced to use these pokemon without it having anything to do with the influence of SEXP.

And Hip, you are the only one who maintains that way we run the OU Suspect Test process with Rating/Dev and bold voting is still bad regardless of SEXP, so this isn't even worth my time again until you, instead of just making a laundry list of flaws that need to be cleaned by someone, clearly flesh out, as I specifically asked in my "Bold Voting and Rating/Deviation requirements—A happy medium?" thread, what you would actually do to improve the entire process that hasn't already been addressed by me, Aeolus, X-Act and Doug.

I never said that.
Come on, Hip, these are your own words:

You are adding a group of players to the voter pool who werent able to make the rating/deviation requirements based on the fact that they used the suspect in all of their battles
You said that on this very page, less than 27 hours prior to this denial. Did you choose your words poorly again or are you really that forgetful? Either way I'm tired of arguing with you on this, because you're not only making me repeat myself but also making me repeat yourself, and I have better things to do than to copy/paste people's own contradictions to make my own points.

I dont care about the many other uses, I care about the main use, which as far as I can tell is distorting our vote results.
Doug and I have belabored this point enough. Besides, you stated last year that people were throwing matches because their deviation was too low. X-Act, Doug and I addressed this by not only doing away with the Upper Requirement but adding a group of more qualified voters through SEXP.

I dont even know Doug, I've spoken to him like maybe twice in the whole time he has been at smogon.

What I mean here is the fact that I know you have thought about something for a long, long time doesnt mean I should accept it as a good idea. Perhaps I didnt choose the right words, but honestly I dont think I implied any more than what I just explained. If you think I did, then my word choice was poor.
You don't "know" me either, and you and I have a terrible history so you should know better than to say "I don't trust your judgment" to me without wondering how I will perceive it. To say that is considerably disrespectful, and I already told you in private last year that you should not address me if you don't feel you can respect me. You decided to disregard this and pick another public fight with me, and that's stupid. Especially because you got so incredibly offended, when I suggested last year that your refusal to offer any improvements indicated pessimism with the Bold Voting process, that you resorted to name-calling for the second time. You either have a dangerously bad memory or you are deliberately instigating.

Also to say I dont trust someone's judgement doesnt mean that I think everything they have done is wrong. I dont agree with a lot of your decisions regarding the tiering process, you know that. So, when you make a new decision, I dont see why I would trust it just because you have thought about it a long, long time, when in the past, you have made decisions after thinking about them a long, long time that I have disagreed with.
Disagreement on a specific issue is a lot different than directly questioning someone's judgment, they aren't even close to being the same thing. This entire forum runs on establishing ideas for policies, and then determining points of disagreement, explaining why there is a disagreement, and working together in order to polish the ideas. If everybody agreed with everything Jabba or Tangerine or Doug or I suggested, we would just have a bunch of followers. Not trusting someone's judgment means that you categorically do not trust any ideas or thoughts they have on policy.

No you have misunderstood me. I dont need to trust your judgement to support those things. I can make judgements on them myself. And this isnt a case of me saying "you need to make the sexp formula public" honestly, I dont really care about that. What I care about is the fact that it exists in the first place.
No, what you really care about is why we are still bold voting in the first place. Doug and I have made painfully clear that SEXP is an objective measure that helps vote talliers with the completely subjective voter paragraph process. So regardless of not knowing what exactly the formula is, it is impossible for you to be on board with SEXP if it's aiding a process you disagree with even more, which is why I wonder why you waste your time fighting a battle in a war you don't even want to participate in.
chaos suggested I make a new thread where I recommend going back to the basics with this testing process. So I am going to do that. But I was halfway through this post and I thought I should clarify the point about judgement, and it seems like a waste to delete the rest of it.

Have a nice day.
Even though you already started to do this in November and then chose not to respond to my responses, I'm guessing you'll post this time. I seriously hope that you won't make me copy/paste those responses when you do.
 

cim

happiness is such hard work
is a Contributor Alumnusis a Smogon Media Contributor Alumnus
He already contradicted himself on that "point" on IRC three nights ago. When I asked why SEXP actually wouldn't be better in UU than in OU since in UU you actually won't know the Suspects until after the SEXP is used, Chris said that it would affect the actual metagame and people will still know what is going to be nominated and voted a suspect and therefore it would "contaminate the data". I said that this is the way it's done now in OU Suspect Tests, and both he and SevenDeadlySins said that the UU test is different. When I asked how, Chris said "because in uu you don't know the suspects", which I immediately called out as a contradiction and SDS shortly agreed (relevant because he's also debating against me in addition to, if anything, being biased towards supporting Chris).
Just wanted to point out that this was one of several reasons the UU test is different, and it's not really fair to cherry pick one of the times I misspeak in an IRC chat about the test when I have a long history of not debating well in a live format in order to parade around a "win" you had on me to prove a point to Hipmonlee. Pointing out one of the admittedly numerous times I spoke hastily or did not think words through in order to try and convey that I'm obviously a fool with less understanding of basic logic than you (out of context at that) is at best unfair.

Oh, and

No matter what, the UU metagame is going to be artificially altered or whatever you want to call it, because you don't need the looming spectre of SEXP to prompt people to whisper and talk about what's going to be nominated and what's probably going to be a suspect. I made my point about starting from the beginning with the "six deadly suspects" for a reason—if you want to see retroactive SEXP numbers on those pokemon in UU, you will see just how much people were influenced to use these pokemon without it having anything to do with the influence of SEXP.


Yeah, except that's natural inlfuence, not artificial. There's a difference between "hey man, why don't you try out babiritar it totally takes people by surprise" and "Oh no, everyone hates Tyranitar, it's going to be a Suspect, I better use it so I can vote OU / Uber since I've never / always had a problem with it". If you really want to claim that the natural progression of a metagame is just as unhealthy for it as mandating that Pokémon be used, a lot, in it, then go ahead, but I doubt anyone would buy that.

The rest is addressed at Hip and frankly he's a better poster than me so whatever.
 

Caelum

qibz official stalker
is a Site Content Manager Alumnusis a Community Leader Alumnusis a Smogon Discord Contributor Alumnusis a Tiering Contributor Alumnusis a Top Contributor Alumnusis a Smogon Media Contributor Alumnusis a Battle Simulator Moderator Alumnus
Just a random update.

Random family stuff came up and I got busy. Voting threads will be up sometime on Friday. Just wanted to say that so people didn't think I died and I didn't know where else to put that anyway.
 

Users Who Are Viewing This Thread (Users: 1, Guests: 0)

Top