1. New to the forums? Check out our Mentorship Program!
    Our mentors will answer your questions and help you become a part of the community!
  2. Welcome to Smogon Forums! Please take a minute to read the rules.

Making Battle Logs Publicly Available

Discussion in 'The Policy Review' started by Antar, Nov 2, 2011.

  1. Antar

    Antar Self-anointed Czar of LC UU
    is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
    Official Data Miner

    Joined:
    Feb 17, 2010
    Messages:
    3,168
    I'll be brief.

    I'd like to be able to upload each month's battle logs to some file hosting site for people to be able to pore over, either to double-check my findings or to look at new and interesting things. Compressed, we're talking about ~300MB per month. Megaupload can handle that.

    My concern, however, is privacy. It's one thing for trusted Smogon staff members to be allowed access to logs of every single battle by every single trainer who logs onto the server. It's another thing to give everyone the same right. It's hard for me to think of ways people would abuse the system--things might be different if there were no team preview--but I'm not very creative in thinking up these things.

    So? What do people think?
  2. eric the espeon

    eric the espeon maybe I just misunderstood
    is a Forum Moderator Alumnusis a Researcher Alumnusis a CAP Contributor Alumnusis a Tiering Contributor Alumnusis a Contributor Alumnus

    Joined:
    Aug 7, 2007
    Messages:
    3,694
    I think something along the lines of this was brought up before. There are some pretty big advantages for people wanting to use the data for other fun things, and ways to mostly solve privacy issues. A combination of anonymizing usernames to make it awkward to track down specific foe's teams (perhaps giving a rating score rather than username to aid people wanting to make ranked stats) and perhaps delaying release of each month's stats for a few weeks to allow team turnover should almost entirely clear those worries. In favor, so long as it's not going to be practical for someone to code something which mines the data easily and picks out/predicts exactly the foe's team based on their username/revealed pokes mid battle. Looking forward to the stats that can come out of it.
  3. Antar

    Antar Self-anointed Czar of LC UU
    is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
    Official Data Miner

    Joined:
    Feb 17, 2010
    Messages:
    3,168
    Each trainer actually has a unique ID that can be used. Only problem is that anonymizing the data would involve edits to the source code.

    I like this idea. Maybe upload each month's battle logs with the next month's usage stats.

    PMmed to me:
    Is this really an issue? Are there tournaments that stretch across multiple months? If so, do said tourneys stretch across more than TWO months?

    Keep in mind, unless specifically specified, spectators are allowed to watch any battle and essentially record their own log. If the concern is access to full moveset data, those logs are now being generated, but they're stored separately, and I can keep those private.
  4. david stone

    david stone Fast-moving, smart, sexy and alarming.
    is a Site Staff Alumnusis a Smogon IRC AOp Alumnusis a Programmer Alumnusis a Super Moderator Alumnusis a Researcher Alumnusis a Contributor Alumnusis a Battle Server Moderator Alumnus

    Joined:
    Aug 3, 2005
    Messages:
    5,150
    If we do this, we should probably remove all commentary from the logs. People probably assume that if it's just two people in a battle talking, that it will remain private information between those two people.
  5. eric the espeon

    eric the espeon maybe I just misunderstood
    is a Forum Moderator Alumnusis a Researcher Alumnusis a CAP Contributor Alumnusis a Tiering Contributor Alumnusis a Contributor Alumnus

    Joined:
    Aug 7, 2007
    Messages:
    3,694
    Or perhaps editing the logs before publicizing them?

    Yea, that's the idea.

    People who really care about keeping their team a secret should probably battle on alts when testing that team, and this would do little to change that. So long as there is no username->team link, it's hard to get much data useful to a specific match. However I have had a thought for a possible very fun use for the data, being able to play ladder matches without having to build or pick a team, by making a program which randomly selects a team used by someone else in the tier before each match.

    And agreeing with david stone.
  6. Antar

    Antar Self-anointed Czar of LC UU
    is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
    Official Data Miner

    Joined:
    Feb 17, 2010
    Messages:
    3,168
    again, isn't this information available to spectators?

    As with anonymizing, removing chats from logs will likely involve making significant changes to the PO source.

    Actually, I could probably come up with an anonimyzing script that works on raw battle logs, but it'll be very processor- and time-consuming.
  7. david stone

    david stone Fast-moving, smart, sexy and alarming.
    is a Site Staff Alumnusis a Smogon IRC AOp Alumnusis a Programmer Alumnusis a Super Moderator Alumnusis a Researcher Alumnusis a Contributor Alumnusis a Battle Server Moderator Alumnus

    Joined:
    Aug 3, 2005
    Messages:
    5,150
    It is available to spectators, yes, but it's one thing to have to join all battles and log it yourself (with all people in the battle knowing that there are spectators there and who those spectators are). It's quite another thing to give that information to everyone.
  8. Antar

    Antar Self-anointed Czar of LC UU
    is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
    Official Data Miner

    Joined:
    Feb 17, 2010
    Messages:
    3,168
    Oh hey, I never replied to this.

    Okay, based on the feedback I'm getting, I won't release the logs until I've come up with an "anonymizing" script. It's gonna be obnoxious, but it's definitely doable. Don't expect anything soon, though.
  9. david stone

    david stone Fast-moving, smart, sexy and alarming.
    is a Site Staff Alumnusis a Smogon IRC AOp Alumnusis a Programmer Alumnusis a Super Moderator Alumnusis a Researcher Alumnusis a Contributor Alumnusis a Battle Server Moderator Alumnus

    Joined:
    Aug 3, 2005
    Messages:
    5,150
    Do you have an idea for how you'll anonymize, and what degree of anonymity you are trying to get?

    As an example of the kind of pitfalls anonymization can have, if you give each user a unique ID, this allows looking at a player's overall performance, which could theoretically be useful (but I don't imagine it being that useful). However, it also means that if you can identify a user in any battle, you can identify all of that user's battles.

    The unique ID solution also has the problem of generating that ID. You can't do anything like a hash of the user name (as I could then just hash any user that I want to investigate). It seems like the best solution to this is probably to generate a long, random string that you use per data set (or per user, but that's not essential with solid encryption like AES), and have the user ID be the encryption of the user name with that long, random string as the key. You could then discard this randomly generated key and there would be no way to find out who any user is (with as high level of certainty as you can possibly get), and there is no risk of collision (because encrypted results are guaranteed to be unique, unlike a hash).

    What I would recommend for maximum anonymity is also the easiest to implement: "Player1" and "Player2".

    Ideally, each battle log would include both players ratings at the start of the battle so we can generate weighted stats with it and also measure the performance of our rating system. Most rating systems (such as Elo and Glicko, I don't know what PO uses) allow you to calculate the odds that a particular player beats another player, given only their rating. We could see how accurate this is, or if certain groups of players tend to be overrated or underrated.
  10. Antar

    Antar Self-anointed Czar of LC UU
    is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
    Official Data Miner

    Joined:
    Feb 17, 2010
    Messages:
    3,168
    Each trainer already has a unique ID. The server uses it in naming the log files. The idea would be to go through and remove all lines of chat and replace each incidence of the trainer's name with his/her ID number (I'm brainstorming now what to do about pathological cases, namely trainers with pokemon names and pokemon with trainer names).

    Naturally. But, in most cases, the same thing can be accomplished by just looking at the pokemon on the trainer's team.

    Talk about overkill.

    This is, of course, doable, but prevents the generation of "1337" stats and similar stats that use only a segment of the player base.

    This would involve modifications to the source code that, frankly, I've been meaning to make anyway and that I should have made on the last revision. It will mean a slight difference in what "1337" stats mean, since, right now, I look at end-of-month stats (the idea being you're looking at all battles done by a given trainer, not just after they finished laddering).



    Bottom line, though: is it really worth going to all this trouble to protect anonymity? Seriously, I'm asking. Because I see only a limited amount of mischief anyone could accomplish by analyzing all battles done by a certain player a month in the past, and if you make the process more difficult by introducing a unique ID, I can't imagine anyone going through the trouble. But, then again, I'm the trusting sort.
  11. Aldaron

    Aldaron wu tang clan ain't nothin to fuck wit
    is a Tournament Directoris a Battle Server Administratoris a Smogon IRC SOPis a Tiering Contributoris a Contributor to Smogonis an Administrator
    OU and IRC Leader

    Joined:
    Aug 5, 2007
    Messages:
    4,490
    Doesn't obi's suggestion to just change every name to player 1 and player 2 (arbitrarily for each battle) and then remove comments remove any issues of anonymity we might have?

    Btw, I definitely do think it is important to do as much as we can to protect player anonymity, because a large part of winning in 4th and fifth gen singles is stylistic surprise.
  12. Antar

    Antar Self-anointed Czar of LC UU
    is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
    Official Data Miner

    Joined:
    Feb 17, 2010
    Messages:
    3,168
    Yes, but it removes the number of analyses that can be performed. Forgetting 1337 stats (I actually have an idea about that), I was thinking at one point about looking at a separate weighting along the lines of "one alt, one vote," where you'd count up usage across all teams used by a given username and weight the contribution of all usages associated with that alt so that the sum total was only 6. This would not be possible if every battle was recorded as between "Player 1" and "Player 2."

    Fair enough.
  13. Chou Toshio

    Chou Toshio @Fighting Necktie
    is a Forum Moderatoris a Community Contributoris an Artist Alumnusis a Contributor Alumnusis a Smogon Media Contributor Alumnusis a Battle Server Moderator Alumnus
    Moderator

    Joined:
    Aug 16, 2007
    Messages:
    8,242
    Why don't we only make battles from the ladder public? Then we can skip any issues with tournaments. We could just make the logs public 2 weeks after they're collected. For the most part, at that point, people will have changed their teams or their teams will have lost edge anyway.
  14. david stone

    david stone Fast-moving, smart, sexy and alarming.
    is a Site Staff Alumnusis a Smogon IRC AOp Alumnusis a Programmer Alumnusis a Super Moderator Alumnusis a Researcher Alumnusis a Contributor Alumnusis a Battle Server Moderator Alumnus

    Joined:
    Aug 3, 2005
    Messages:
    5,150
    Well, that's where my suggestion of using AES comes in. You can encrypt the username and include that at the top of the battle in the form of Player1 = Aerw9sfdjlk45, Player2 = CXx09j345nsd. Then if you want to get all of the stats about a particular player, you can search for all battles in which that encrypted unique ID is present. But as I said, this has the drawback of creating a unique token for a player that allows all of their battles to be tracked. Without that token, you cannot learn very much more than you already know.

    Pokemon Online does give a player ID, but every time you reconnect, you get the "next" player ID available (and when the server restarts, the player IDs go back to 1). This does not provide monthly tracking, unless there is yet another player ID hidden some where.

    I'm not saying that we need to go one way or the other, I'm just saying the right way to do it with maximum anonymity (only Player1 and Player2), or the way to maximize anonymity while still allowing monthly tracking (with the AES encrypted user name with a randomly generated token used as the key, and that key is not saved / given out). If we add player ratings to the logs, then we won't even need to track who is who to determine the '1337 stats', because the rating would be built right into the log.
  15. Surgo

    Surgo goes to eleven
    is a Site Staff Alumnusis a Smogon IRC AOp Alumnusis a Programmer Alumnusis a Contributor Alumnusis an Administrator Alumnus

    Joined:
    Jul 10, 2006
    Messages:
    3,635
    This is excellent. I was actually directed to ask you about getting exactly this data, without knowing you had already proposed uploading it!

    Basically, I am attempting to create a pokemon-playing robot that doesn't really have any intelligence but just uses data like this to make decisions. This is exactly what I need.

    No source code modification is necessary, all the anonymization is just a sed script away.
  16. Antar

    Antar Self-anointed Czar of LC UU
    is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
    Official Data Miner

    Joined:
    Feb 17, 2010
    Messages:
    3,168
    Dammit! I was really hoping no one was going to actually want the logs so we could bypass the whole issue. Ah well... so much for laziness.

    This is a really cool and ambitious project. You will theoretically have issues with players who name their pokemon things like "Latios used Surf," but I've already gotten to work on a "matchup matrix," which analyzes what happens when pokemon x goes against pokemon y. I've been meaning to post the results, but presenting the data is kinda hard, since it's four-dimensional, and I've been busy with other projects (like work).

    This is true unless you want the player rankings as well.



    Thanks to your post I'll prioritize trying to figure out the anonymization stuff. I think I'm just going to make a ruling here and now:

    Fully anonymized battle logs ("Player 1 vs. Player 2") will be available one month after I post the usage stats (so, in other words, November's logs will be available January 1). I will also make non-anonymized logs available individually, upon request, and only to people who have been vetted (probably only Smogon staff).

    Not sure what month I'll start with. It depends on how busy I am in the next two weeks.
  17. Antar

    Antar Self-anointed Czar of LC UU
    is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
    Official Data Miner

    Joined:
    Feb 17, 2010
    Messages:
    3,168
    As much as I wanted to keep my post count at 666, I knew I couldn't do it forever.

    I've written an anonymizer script. It's gonna need some testing, but so far it seems to successfully remove trainer names from the battle logs (but not pokemon nicknames).

    Everything I've written so far is up on my shiny new github repo! Feel free to take a look and, if you have anything to add, feel free to contribute!
  18. david stone

    david stone Fast-moving, smart, sexy and alarming.
    is a Site Staff Alumnusis a Smogon IRC AOp Alumnusis a Programmer Alumnusis a Super Moderator Alumnusis a Researcher Alumnusis a Contributor Alumnusis a Battle Server Moderator Alumnus

    Joined:
    Aug 3, 2005
    Messages:
    5,150
    You could also release a torrent of these logs to minimize server strain now that megaupload doesn't exist.
  19. Antar

    Antar Self-anointed Czar of LC UU
    is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
    Official Data Miner

    Joined:
    Feb 17, 2010
    Messages:
    3,168
    Then I have to seed the torrent.

    Anonymized logs have continued to be in the back of my mind, but seeing as how no one's been bugging me for them, I haven't made it a high priority.
  20. Aldaron

    Aldaron wu tang clan ain't nothin to fuck wit
    is a Tournament Directoris a Battle Server Administratoris a Smogon IRC SOPis a Tiering Contributoris a Contributor to Smogonis an Administrator
    OU and IRC Leader

    Joined:
    Aug 5, 2007
    Messages:
    4,490
    You've posted nearly 1200 times in 11 months...impressive antar, impressive.
  21. david stone

    david stone Fast-moving, smart, sexy and alarming.
    is a Site Staff Alumnusis a Smogon IRC AOp Alumnusis a Programmer Alumnusis a Super Moderator Alumnusis a Researcher Alumnusis a Contributor Alumnusis a Battle Server Moderator Alumnus

    Joined:
    Aug 3, 2005
    Messages:
    5,150
    I guess I should bug you for them.

    I decided that it would probably be better if I had the ability to analyze the logs myself rather than having to bug you to write a script every time I have a new idea for Technical Machine.
  22. Honko

    Honko
    is a Programmeris a Tiering Contributoris a Contributor to Smogon

    Joined:
    Dec 6, 2009
    Messages:
    1,172
    I bugged you for them several months ago via PM. Consider yourself re-bugged!
  23. Antar

    Antar Self-anointed Czar of LC UU
    is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
    Official Data Miner

    Joined:
    Feb 17, 2010
    Messages:
    3,168
    If you bugged me several months ago, that would've been right in the middle of the PO<->PS transition, which is why I didn't register it.

    Okay, so in this new PS era, we have some different issues:

    • PS logs contain a LOT more information than PO logs, namely FULL MOVESET INFORMATION. With PO logs, reverse-engineering someone's team from a log would've been a lot of work and with weird sets would have either involved a lot of guesswork or a lot of calcs. With PS logs, it's all explicitly included.
    • Unrated battles are not logged, so the privacy of tournament stuff is no longer an issue
    • I don't have access to trainer's unique numerical IDs, so I would either have to create my own alt<->ID lookup table (which isn't HARD) or just have everything be p1 vs. p2, which has the issue of not being able to do the "1 alt, 1 vote" system or "1337" stats (not an issue until Zarel stops resetting the damn ladders).
    • PS is *a lot* more active than Smogon PO ever was. Uncompressed, September's logs are 10.4 GB in size. This is going to make distribution a pain (I really don't want to be seeding a torrent that large from my home computer, and our firewall on the PS server is too restrictive to allow torrenting). Edit: Compressed (bzip2) it's down to 800 MB, which is too large for the free filehosting sites I know, and it's STILL a bit large for me to want to seed it continuously.

    So what should I be doing to anonymize PS logs?

    I'm thinking,
    1. Replace alts with unique numerical ID, with the alt<->ID lookup table being something my script generates, but which is not published
    2. Replace all nicknames with species names
    3. Remove all chat

    I could also quite easily remove the moveset data as well, but I'd prefer to give people who want it access to that data. Thoughts?
  24. david stone

    david stone Fast-moving, smart, sexy and alarming.
    is a Site Staff Alumnusis a Smogon IRC AOp Alumnusis a Programmer Alumnusis a Super Moderator Alumnusis a Researcher Alumnusis a Contributor Alumnusis a Battle Server Moderator Alumnus

    Joined:
    Aug 3, 2005
    Messages:
    5,150
    The move set data is one of the most important things for me to have, so that I can generate move set stats similar to team mate stats (but far more in-depth).

    If you torrent, it wouldn't just be you uploading, though. I would upload to at least a ratio of 1.5. At the very least, people who were downloading it would also be uploading at the same time. I don't know how what the average ratio would be for seeding (excluding you from the stats).

    As for compression, you may want to look into 7zip compression. My understanding is that it compresses text better than most other compression algorithms. http://www.codinghorror.com/blog/2009/02/file-compression-in-the-multi-core-era.html

    Removing chat should also reduce the size of the logs a bit.
  25. Antar

    Antar Self-anointed Czar of LC UU
    is a Battle Server Administratoris a Programmeris a Super Moderatoris a Community Contributor
    Official Data Miner

    Joined:
    Feb 17, 2010
    Messages:
    3,168
    I like it, too, and I'd prefer not to remove it. The issue is whether it makes it too easy to counterteam and flat-up steal highly successful teams. I feel that putting a 1-3 month delay on releasing anonymized logs solves this problem.

    Re: Torrenting

    I get the sense that there are very few people who would actually be interested in getting these logs. I think in that case that the best way to distribute is via p2p transfer (ssh, skype, ftp) on a case-by-case basis. Basically, you ask me for the logs, I send them to you, and you agree to send them to another person if someone else wants them.

    Yeah, I'm aware. But bzip2 and 7zip are order-of-magnitude comparable (and I was going for a rough estimate of compressed file size), and it's also quite a bit faster.

    A minuscule amount. Most battles have no chat, and those that do have minimal amounts of chat, especially compared to the length of the logs.

Users Viewing Thread (Users: 0, Guests: 0)