1. Welcome to Smogon Forums! Please take a minute to read the rules.
  2. Click here to ensure that you never miss a new SmogonU video upload!

March Stats

Discussion in 'Policy Review' started by Antar, Apr 9, 2017.

  1. Antar

    Antar
    is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
    Official Data Miner

    Joined:
    Feb 17, 2010
    Messages:
    3,885
    In case you haven't been following, the PS server issues mean I can't run my stats scripts on the server, and copying them off is slow as balls.

    If we wait to get all the logs in a place where I can process them, we're looking at a delivery date for the March stats some time in May, even if I just do the usage-based tiers.

    There is a shitty alternative though: running stats over a subsample of the month's stats, maybe 10-100k battles for each tier?

    That might only take a few days to copy over and should process in a few hours.

    The primary issue is going to be figuring out a way to have the subsample be at least somewhat random, but I think that's just a matter of researching functionality in rsync.

    Thoughts?

    Stats aren't usually within less than a tenth of a percentage point of the cutoff, so someone who's taken Stats 101 should be able to tell me what size sample you need to have high confidence in the outcome.
  2. Disjunction

    Disjunction Cool Essence
    is a member of the Site Staffis a Smogon Social Media Contributoris a Forum Moderatoris a Community Contributoris a Tiering Contributoris a Contributor to Smogon
    Moderator

    Joined:
    Mar 22, 2014
    Messages:
    1,403
    I think the sampling idea is good, but, as anyone who has taken stats 101 should know, runs a risk by itself of being a very biased representation of any changes that happen. For the purpose of forming NU, this is probably fine considering our metagame will dramatically change come May regardless of samples, but I don't want to have this end up misrepresenting stats for OU, UU, and RU and, consequently, dropping some awful creatures that shouldn't have to make the existing councils do anymore work than they already need to.

    I think if we went down this route then we should have a higher cutoff for drops and rises to prevent any serious damage sampling error could cause. If people are comfortable with the cutoff we've been using for the quick drops, then I believe that would be the easiest and most recognizable bar we could use.

    I don't have much of an opinion on the sample size, either, as I feel we are slightly in the dark on the process, resources, etc that are being used here. You mentioned in the usage stats thread that just OU alone would take the next month, but I don't know what that means in regard to UU/RU stats considering they are significantly smaller in size. Would RU's full battle's played size be fine to run, sitting at ~100K battles in Feb? Do we have to only run half of UU's battles because they're at 200K? Obviously a bigger sample is better, but we don't want to go so big that we make the stats delay another month anyhow.
    Based Loser, laum and TJ like this.
  3. Bughouse

    Bughouse Like ships in the night, you're passing me by
    is a member of the Site Staffis a Forum Moderator Alumnusis a CAP Contributor Alumnusis a Tiering Contributor Alumnusis a Contributor Alumnus

    Joined:
    May 28, 2010
    Messages:
    5,578
    What's your capacity to truly pull a random sample? Can you at the very least control for the day the battle occurred on? That's the biggest enemy of randomness here, since trends come and go.
  4. Zarel

    Zarel Not a Yuyuko fan
    is a member of the Site Staffis a Battle Server Administratoris a Programmeris a Pokemon Researcheris an Administrator
    Creator of PS

    Joined:
    Aug 16, 2011
    Messages:
    3,593
    Sampling bias isn't really that hard to deal with. If you know your sample method, you know what biases it introduces.

    Time of day and date range are the main worries, so a sample such as "every third day" is going to be reasonably free of bias.
    thesecondbest, wishes, nv and 7 others like this.
  5. quziel

    quziel I simulate Pottery
    is a Pre-Contributor

    Joined:
    Nov 23, 2015
    Messages:
    225
    Sorry to interject, but tried to work out some minimum values for the number of battles needed to find an estimate for the usage percentage at various confidence levels.

    Assumptions (open)


    1. Usage stats around 3.41% are most important for tiering

    2. Margin of error of 0.1% is acceptable

    3. Underlying distribution is large enough to be approximately normal

    4. Confidence level Z values are << n



    Methods (open)


    · Modifying confidence interval for population proportion; most important part of the formula is diff. between upper and lower bounds

    · Choosing population proportion of interest to be p=0.0341

    · Varying confidence levels of 90% (z*= 1.645), 95% (z*=1.96), and 99% (z*=2.575)

    Equations (open)

    [​IMG]
    Mistaken Equations (open)

    [​IMG]


    Variable definitions (open)


    z*= z-critical value for various confidence intervals (relates to area on normal distribution, here its 90% (z*= 1.645), 95% (z*=1.96), and 99% (z*=2.575))

    p = sample proportion (as we are most about usage near drop points, going to set it to 3.41% usage, or p = 0.0341)


    Basically wanted to figure out a confidence interval for the true population proportion, so messed with the formula a bit.

    mistakes made (open)

    Did a couple of numerical calculations, and found the following N values for Confidence intervals of 90%, 95%, and 99% respectively (sorry, unsure how to make a table).

    Conf. Level N

    90% 912
    95% 126500
    99% 218500​

    These numbers are only approximate, and could vary a bit, but should give a vague idea of the number of battles needed to achieve accuracy within 0.1% and the confidence we can put into those values.


    Edit: It seems that there was a mistake or two in the equations I was using that was only apparent at "low" N. I am currently redoing stats and will update the post once done.

    There's a large chance that I could have gotten some of my working wrong, so if anyone more knowledgeable in statistics could check my work, that would be wonderful. Also, if anyone knows how to properly format the Conf. Level / Min. Battles into a table, could you please PM me?
    Last edited: Apr 10, 2017
    Snou, wishes, Watchog and 15 others like this.
  6. Antar

    Antar
    is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
    Official Data Miner

    Joined:
    Feb 17, 2010
    Messages:
    3,885
    Zarel, "every third day" is still WAAY too much. I was thinking more along the lines of "battles whose numbers end in "00" (that should give ~20k OU matches). I was also hoping there'd be an easy way to randomize rsync's order of transfer, but if there is a I've yet to find it.
    icameron, Ununhexium and Plancklength like this.
  7. Zarel

    Zarel Not a Yuyuko fan
    is a member of the Site Staffis a Battle Server Administratoris a Programmeris a Pokemon Researcheris an Administrator
    Creator of PS

    Joined:
    Aug 16, 2011
    Messages:
    3,593
    Antar, Yes, "battles ending in 00" is a good criterion and is not going to cause any sampling biases I can think of. Be mindful of sample size, but most people overestimate the amount of sample size necessary for a good approximation.
  8. phantom

    phantom
    is a member of the Site Staffis a Forum Moderatoris a Community Contributoris a Contributor to Smogonis a Tiering Contributor Alumnus
    RU Co-Leader

    Joined:
    Mar 3, 2013
    Messages:
    1,244
    I don't agree with using a subsample of the stats. If the full stats aren't there, then there shouldn't be any tier shifts for this month or an NU alpha ladder. I don't see the particular urgency in using a fraction of the stats for this, which are likely to differ from the full stats and thus throw UU/RU out of wack with whatever tier shifts result from it. I would understand the need to do this if NU was going into beta this month, but it's not, and NU alpha is not an official tier. Meanwhile, a tier shift with potentially volatile stats will negatively affect tiers that are actually official and have been tiering. I don't think the tradeoff for a messy ladder for an unofficial tier is worth putting official tiers in a potentially bad spot. The only solution that seems best for this to skip over stats for the month, allow RU to exit beta with what it currently has, let the tier shifts that were supposed to occur this month happen the next, and allow NU to skip over alpha and go into beta the following month as well. This would keep mostly everything on course, including not pushing PU back a month, while keeping the other tiers in a stable position instead of gambling with a small sample of stats that could result in some pretty wacky tier shifts.
  9. Raseri

    Raseri Giratina-Addicted
    is a Tutor Alumnusis a Site Staff Alumnusis a Battle Server Admin Alumnusis a Super Moderator Alumnusis a Community Contributor Alumnusis a Researcher Alumnusis a Tiering Contributor Alumnusis a Contributor Alumnusis a Smogon Media Contributor Alumnus

    Joined:
    Aug 4, 2007
    Messages:
    6,241
    I would like to use a subsample of the stats. I trust our resident smart people Zarel and Antar on how to best handle it. quziels confidence intervals look promising too but i dont know stats very well.

    Either way, the NU playerbase would really appreciate some sort of way to play
    Hootie, avocado, lolbro and 3 others like this.
  10. Zarel

    Zarel Not a Yuyuko fan
    is a member of the Site Staffis a Battle Server Administratoris a Programmeris a Pokemon Researcheris an Administrator
    Creator of PS

    Joined:
    Aug 16, 2011
    Messages:
    3,593
    You have an unusually huge gap between the 90% and 95% confidence levels, so I'm recalculating them.

    upload_2017-4-9_23-28-34.png

    is the approximation function for a Bernoulli sample. I'm not exactly sure what formula you're using instead, but yours looks more complex than necessary.

    our target p-hat is 0.0341. We want our confidence interval to be within 0.1%, so we want our confidence interval to be p-hat ± 0.001.

    plugging in our numbers, we get

    upload_2017-4-9_23-34-20.png
    upload_2017-4-9_23-37-25.png

    z is 1.644 at 90%, 1.959 at 95%, 2.575 at 99%

    Confidence level | N

    90% | 89,021
    95% | 126,402
    99% | 218,394

    So we got the same numbers for 95% and 99% confidence intervals, but I assume you messed up your 90% confidence interval calculation.
  11. quziel

    quziel I simulate Pottery
    is a Pre-Contributor

    Joined:
    Nov 23, 2015
    Messages:
    225
    It seems that I forgot to include the (1+z*^2/n) in the divisor as well as a small mistake with forgetting a square within the root on the numerator, after recalculating my values agree with yours. Sorry for the mistake, thanks for catching it.
  12. Sam

    Sam why not seize the pleasure at once?
    is a Battle Server Administratoris a Tiering Contributoris an Administratoris a Community Contributor Alumnus
    Admin Extraordinaire

    Joined:
    Nov 9, 2011
    Messages:
    1,276
    We can use the sampled RU stats to do NU Alpha since it's nothing official (provided they aren't completely whack but it seems that likely won't be the case?). I do agree with Spirit though, we shouldn't actually tier with the sampled stats.
  13. Antar

    Antar
    is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
    Official Data Miner

    Joined:
    Feb 17, 2010
    Messages:
    3,885
    Honestly, if we had 99% confidence that sampled stats would mirror full stats, it would be stupid not to just switch to sampled stats moving forward, even once the new server is up and running. Imagine usage stats coming out on the 1st or 2nd instead of the 7th-10th. Can you really tell me that wouldn't be worth it?
    avocado, Freeroamer, Calista and 17 others like this.
  14. Anty

    Anty dawn breaks
    is a Site Staff Alumnusis a Team Rater Alumnusis a Forum Moderator Alumnusis a Community Contributor Alumnusis a Tiering Contributor Alumnusis a Contributor Alumnusis a Smogon Media Contributor Alumnus

    Joined:
    Feb 8, 2013
    Messages:
    2,936
    Is there a sort of sample trial you could do with the previous February stats? Like compare the "battles ending in 00" sample stats with the actual ones to see how close they are (you could chi square test it i think or just look and compare). Would be the best way to convince people.
    Calista, HJAD, Aaronboyer and 15 others like this.
  15. Antar

    Antar
    is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
    Official Data Miner

    Joined:
    Feb 17, 2010
    Messages:
    3,885
    In a perfect world, yeah. But I'm having enough trouble pulling March's logs. When it takes so long to just list and count the files that I have to just give up, I don't see pulling a validation sample first on another month's logs.

    But if you're talking after we move to the new server, then sure. I'll check then.

    Keep in mind: this is not the first time I've calculated stats with partial months. For a few of the Nintendo competitions I've calculated stats early. I've even calculated usage-based tiers early once or twice to give tier leaders a heads-up about upcoming changes. These have nonrandom samples, missing sometimes a full week of data, and very rarely has anything changed.
  16. Honko

    Honko he of many honks
    is a member of the Site Staffis a Programmeris a Contributor to Smogon

    Joined:
    Dec 6, 2009
    Messages:
    1,367
    If we wanna be extra conservative about not making unnecessary tier shifts, while still using a small enough sample of the data to get stats out in a reasonable amount of time, we could always temporarily adjust the cutoffs for rises/drops. Like if you can be 99% confident with a confidence interval of +/- 0.5%, then we only allow rises for >3.91% and drops for <2.91%.

    I don't personally think that's necessary, but it's an option to unblock things for this month if there's strong opposition to trusting the sampled stats.
  17. Antar

    Antar
    is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
    Official Data Miner

    Joined:
    Feb 17, 2010
    Messages:
    3,885
    Well, so far the question is moot because I haven't even managed to pull a subsampled set of files off the server. Apparently rsync doesn't intelligently handle complex wildcards, so I'm going to have to write a decently complicated script.
  18. Peef Rimgar

    Peef Rimgar Other guys'll just feed ya lies! I'll take ya to MICKEY DS!
    is a Pre-Contributor

    Joined:
    Apr 12, 2014
    Messages:
    1,571
    Honestly I would rather just skip this month of stats and put all resources to making sure the same mess doesn't happen next month than to pull a sample for a month that's near half over anyways. By the time this got done it wouldn't even be worth the time you put into writing the script.
    sedertz and Arifeen like this.
  19. Zarel

    Zarel Not a Yuyuko fan
    is a member of the Site Staffis a Battle Server Administratoris a Programmeris a Pokemon Researcheris an Administrator
    Creator of PS

    Joined:
    Aug 16, 2011
    Messages:
    3,593
    We are going to have this problem for April as well, it's too late to make sure it won't happen again...

    Now, May, I can relatively safely guarantee it won't be a problem for, if it's any consolation. :p
    Based Loser, Ernesto, Platyp and 3 others like this.
  20. Peef Rimgar

    Peef Rimgar Other guys'll just feed ya lies! I'll take ya to MICKEY DS!
    is a Pre-Contributor

    Joined:
    Apr 12, 2014
    Messages:
    1,571
    So then we're in the position where we can't advance tiering reliably for another month and a half, without a guarantee of it the next month? I feel like we're now at the point of "what other tiering option can we take" instead of "how do we fix the stats" then. We can't just do nothing for that long, and the sample stats, even at 99% confidence (which may still be impossible to take), would be questionable. Looking into short term viability tiering or something has to be a consideration at this stage.
    Raseri and Khaytra like this.
  21. Freeroamer

    Freeroamer The greatest story of them all.

    Joined:
    Jul 28, 2014
    Messages:
    1,144
    I think you're being very harsh, particularly when it's been explained that the problem is the sheer mass of the amount of battles being played, hardly a realistic problem to expect to go away...

    No one wants the issue to be solved via samples stats but as a best case scenario it's a perfectly reasonable alternative to not taking the stats at all. Viability tiering is an option also of course, but it's my belief that if smogon didnt initially tier using what their more experienced members believed to be viable, there's a good reason why they wouldn't consider it next to something we could have 95-99% confidence in, bias seems particularly evident here, as well as the poor PR that could result as of such a decision.

    e: @ below, apologies if my reasoning is wrong, haven't done stats in awhile, be nice if you could tell us what's wrong instead of a one liner though.
    Last edited: Apr 12, 2017
  22. Texas Cloverleaf

    Texas Cloverleaf meh
    is a Smogon Social Media Contributor Alumnusis a Forum Moderator Alumnusis a Community Contributor Alumnusis a Tiering Contributor Alumnusis a Contributor Alumnusis a Smogon Media Contributor Alumnusis a Battle Server Moderator Alumnus

    Joined:
    Oct 23, 2009
    Messages:
    11,010
    I have the feeling a number of people commenting itt haven't taken stats courses recently
  23. Peef Rimgar

    Peef Rimgar Other guys'll just feed ya lies! I'll take ya to MICKEY DS!
    is a Pre-Contributor

    Joined:
    Apr 12, 2014
    Messages:
    1,571
    I'd agree that I'm being very harsh, I just feel like this is a totally avoidable position that we're in. I understand the situation is beyond anyone's control as well. So, why not take the step now and avoid it?
    Assuming you mean to the people commenting that the 99% confidence is "iffy", I do understand that it would probably be just fine, but if we're using a debatable and subjective system at that point anyways, why not change it to one that can get working quicker and isn't dependant?
  24. Bughouse

    Bughouse Like ships in the night, you're passing me by
    is a member of the Site Staffis a Forum Moderator Alumnusis a CAP Contributor Alumnusis a Tiering Contributor Alumnusis a Contributor Alumnus

    Joined:
    May 28, 2010
    Messages:
    5,578
    There are reviews of large government programs funneling billions of dollars around in the US (possibly trillions in the world) that are based on statistical samples... This is not me talking out of my ass - I literally work on one of these at my job. We can do this because well-designed statistical samples work.

    I think we can use them when necessary on a Pokemon site (and frankly wouldn't be opposed to replacing full population stats with sample stats if they can be done much more quickly).
    blarajan, GMars, acidphoenix and 57 others like this.
  25. UltiMario

    UltiMario Out of Obscurity
    is a Pokemon Researcher

    Joined:
    Aug 11, 2009
    Messages:
    1,518
    For real though, even if we used "games ending in 00" (20k games, way below our true 90% confidence interval) we'd still have a better relative sample size than samples used to predict the outcomes of elections and for companies to make financial decisions.

    I remember working with my brother once on analysis of a government survey that was supposed to represent specific income information for the entirety of the United States. The sample size was like 13k people. Surveys to analyse political opinions among the populace of countries often fall in the 1-2k range. Hopefully this gives an idea of how silly people are being over samples for Pokemon stats.

    Assuming I did my math right, 20k games in OU with our current stats pulls us to about 90% confidence interval on +/- 0.55% error. Just change the add/drops as Honk suggested if you all are THAT paranoid about screwing up the metagame on sample size.
    Last edited: Apr 14, 2017
    yuruuu, avocado, GMars and 24 others like this.

Users Viewing Thread (Users: 0, Guests: 0)