Making Battle Logs Publicly Available

I'll be brief.

I'd like to be able to upload each month's battle logs to some file hosting site for people to be able to pore over, either to double-check my findings or to look at new and interesting things. Compressed, we're talking about ~300MB per month. Megaupload can handle that.

My concern, however, is privacy. It's one thing for trusted Smogon staff members to be allowed access to logs of every single battle by every single trainer who logs onto the server. It's another thing to give everyone the same right. It's hard for me to think of ways people would abuse the system--things might be different if there were no team preview--but I'm not very creative in thinking up these things.

So? What do people think?
 

eric the espeon

maybe I just misunderstood
is a Forum Moderator Alumnusis a Researcher Alumnusis a Top CAP Contributor Alumnusis a Tiering Contributor Alumnusis a Top Contributor Alumnus
I think something along the lines of this was brought up before. There are some pretty big advantages for people wanting to use the data for other fun things, and ways to mostly solve privacy issues. A combination of anonymizing usernames to make it awkward to track down specific foe's teams (perhaps giving a rating score rather than username to aid people wanting to make ranked stats) and perhaps delaying release of each month's stats for a few weeks to allow team turnover should almost entirely clear those worries. In favor, so long as it's not going to be practical for someone to code something which mines the data easily and picks out/predicts exactly the foe's team based on their username/revealed pokes mid battle. Looking forward to the stats that can come out of it.
 
A combination of anonymizing usernames to make it awkward to track down specific foe's teams (perhaps giving a rating score rather than username to aid people wanting to make ranked stats)
Each trainer actually has a unique ID that can be used. Only problem is that anonymizing the data would involve edits to the source code.

and perhaps delaying release of each month's stats for a few weeks to allow team turnover should almost entirely clear those worries
I like this idea. Maybe upload each month's battle logs with the next month's usage stats.

PMmed to me:
Lady Salamence said:
I do not have PR Access, so I'd like to show a problem with the idea.

Tournaments
If we implement said idea, and allow anyone to view these battles, then someone could find out their opponent's team, assuming that people don't test teams before using them. I hope, if this idea is implemented, there is some way to get around this problem.
Is this really an issue? Are there tournaments that stretch across multiple months? If so, do said tourneys stretch across more than TWO months?

Keep in mind, unless specifically specified, spectators are allowed to watch any battle and essentially record their own log. If the concern is access to full moveset data, those logs are now being generated, but they're stored separately, and I can keep those private.
 

obi

formerly david stone
is a Site Content Manager Alumnusis a Programmer Alumnusis a Senior Staff Member Alumnusis a Smogon Discord Contributor Alumnusis a Researcher Alumnusis a Top Contributor Alumnusis a Battle Simulator Moderator Alumnus
If we do this, we should probably remove all commentary from the logs. People probably assume that if it's just two people in a battle talking, that it will remain private information between those two people.
 

eric the espeon

maybe I just misunderstood
is a Forum Moderator Alumnusis a Researcher Alumnusis a Top CAP Contributor Alumnusis a Tiering Contributor Alumnusis a Top Contributor Alumnus
Each trainer actually has a unique ID that can be used. Only problem is that anonymizing the data would involve edits to the source code.
Or perhaps editing the logs before publicizing them?

I like this idea. Maybe upload each month's battle logs with the next month's usage stats.
Yea, that's the idea.

Is this really an issue? Are there tournaments that stretch across multiple months? If so, do said tourneys stretch across more than TWO months?
People who really care about keeping their team a secret should probably battle on alts when testing that team, and this would do little to change that. So long as there is no username->team link, it's hard to get much data useful to a specific match. However I have had a thought for a possible very fun use for the data, being able to play ladder matches without having to build or pick a team, by making a program which randomly selects a team used by someone else in the tier before each match.

And agreeing with david stone.
 
If we do this, we should probably remove all commentary from the logs. People probably assume that if it's just two people in a battle talking, that it will remain private information between those two people.
again, isn't this information available to spectators?

As with anonymizing, removing chats from logs will likely involve making significant changes to the PO source.

Actually, I could probably come up with an anonimyzing script that works on raw battle logs, but it'll be very processor- and time-consuming.
 

obi

formerly david stone
is a Site Content Manager Alumnusis a Programmer Alumnusis a Senior Staff Member Alumnusis a Smogon Discord Contributor Alumnusis a Researcher Alumnusis a Top Contributor Alumnusis a Battle Simulator Moderator Alumnus
It is available to spectators, yes, but it's one thing to have to join all battles and log it yourself (with all people in the battle knowing that there are spectators there and who those spectators are). It's quite another thing to give that information to everyone.
 
Oh hey, I never replied to this.

Okay, based on the feedback I'm getting, I won't release the logs until I've come up with an "anonymizing" script. It's gonna be obnoxious, but it's definitely doable. Don't expect anything soon, though.
 

obi

formerly david stone
is a Site Content Manager Alumnusis a Programmer Alumnusis a Senior Staff Member Alumnusis a Smogon Discord Contributor Alumnusis a Researcher Alumnusis a Top Contributor Alumnusis a Battle Simulator Moderator Alumnus
Do you have an idea for how you'll anonymize, and what degree of anonymity you are trying to get?

As an example of the kind of pitfalls anonymization can have, if you give each user a unique ID, this allows looking at a player's overall performance, which could theoretically be useful (but I don't imagine it being that useful). However, it also means that if you can identify a user in any battle, you can identify all of that user's battles.

The unique ID solution also has the problem of generating that ID. You can't do anything like a hash of the user name (as I could then just hash any user that I want to investigate). It seems like the best solution to this is probably to generate a long, random string that you use per data set (or per user, but that's not essential with solid encryption like AES), and have the user ID be the encryption of the user name with that long, random string as the key. You could then discard this randomly generated key and there would be no way to find out who any user is (with as high level of certainty as you can possibly get), and there is no risk of collision (because encrypted results are guaranteed to be unique, unlike a hash).

What I would recommend for maximum anonymity is also the easiest to implement: "Player1" and "Player2".

Ideally, each battle log would include both players ratings at the start of the battle so we can generate weighted stats with it and also measure the performance of our rating system. Most rating systems (such as Elo and Glicko, I don't know what PO uses) allow you to calculate the odds that a particular player beats another player, given only their rating. We could see how accurate this is, or if certain groups of players tend to be overrated or underrated.
 
Do you have an idea for how you'll anonymize, and what degree of anonymity you are trying to get?
Each trainer already has a unique ID. The server uses it in naming the log files. The idea would be to go through and remove all lines of chat and replace each incidence of the trainer's name with his/her ID number (I'm brainstorming now what to do about pathological cases, namely trainers with pokemon names and pokemon with trainer names).

However, it also means that if you can identify a user in any battle, you can identify all of that user's battles.
Naturally. But, in most cases, the same thing can be accomplished by just looking at the pokemon on the trainer's team.

The unique ID solution also has the problem of generating that ID. You can't do anything like a hash of the user name (as I could then just hash any user that I want to investigate). It seems like the best solution to this is probably to generate a long, random string that you use per data set (or per user, but that's not essential with solid encryption like AES), and have the user ID be the encryption of the user name with that long, random string as the key. You could then discard this randomly generated key and there would be no way to find out who any user is (with as high level of certainty as you can possibly get), and there is no risk of collision (because encrypted results are guaranteed to be unique, unlike a hash).
Talk about overkill.

What I would recommend for maximum anonymity is also the easiest to implement: "Player1" and "Player2".
This is, of course, doable, but prevents the generation of "1337" stats and similar stats that use only a segment of the player base.

Ideally, each battle log would include both players ratings at the start of the battle so we can generate weighted stats with it and also measure the performance of our rating system.
This would involve modifications to the source code that, frankly, I've been meaning to make anyway and that I should have made on the last revision. It will mean a slight difference in what "1337" stats mean, since, right now, I look at end-of-month stats (the idea being you're looking at all battles done by a given trainer, not just after they finished laddering).



Bottom line, though: is it really worth going to all this trouble to protect anonymity? Seriously, I'm asking. Because I see only a limited amount of mischief anyone could accomplish by analyzing all battles done by a certain player a month in the past, and if you make the process more difficult by introducing a unique ID, I can't imagine anyone going through the trouble. But, then again, I'm the trusting sort.
 

Aldaron

geriatric
is a Tournament Director Alumnusis a Battle Simulator Admin Alumnusis a Smogon Discord Contributor Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis an Administrator Alumnus
Doesn't obi's suggestion to just change every name to player 1 and player 2 (arbitrarily for each battle) and then remove comments remove any issues of anonymity we might have?

Btw, I definitely do think it is important to do as much as we can to protect player anonymity, because a large part of winning in 4th and fifth gen singles is stylistic surprise.
 
Doesn't obi's suggestion to just change every name to player 1 and player 2 (arbitrarily for each battle) and then remove comments remove any issues of anonymity we might have?
Yes, but it removes the number of analyses that can be performed. Forgetting 1337 stats (I actually have an idea about that), I was thinking at one point about looking at a separate weighting along the lines of "one alt, one vote," where you'd count up usage across all teams used by a given username and weight the contribution of all usages associated with that alt so that the sum total was only 6. This would not be possible if every battle was recorded as between "Player 1" and "Player 2."

Btw, I definitely do think it is important to do as much as we can to protect player anonymity, because a large part of winning in 4th and fifth gen singles is stylistic surprise.
Fair enough.
 

Chou Toshio

Over9000
is an Artist Alumnusis a Forum Moderator Alumnusis a Community Contributor Alumnusis a Contributor Alumnusis a Top Smogon Media Contributor Alumnusis a Battle Simulator Moderator Alumnus
Why don't we only make battles from the ladder public? Then we can skip any issues with tournaments. We could just make the logs public 2 weeks after they're collected. For the most part, at that point, people will have changed their teams or their teams will have lost edge anyway.
 

obi

formerly david stone
is a Site Content Manager Alumnusis a Programmer Alumnusis a Senior Staff Member Alumnusis a Smogon Discord Contributor Alumnusis a Researcher Alumnusis a Top Contributor Alumnusis a Battle Simulator Moderator Alumnus
Well, that's where my suggestion of using AES comes in. You can encrypt the username and include that at the top of the battle in the form of Player1 = Aerw9sfdjlk45, Player2 = CXx09j345nsd. Then if you want to get all of the stats about a particular player, you can search for all battles in which that encrypted unique ID is present. But as I said, this has the drawback of creating a unique token for a player that allows all of their battles to be tracked. Without that token, you cannot learn very much more than you already know.

Pokemon Online does give a player ID, but every time you reconnect, you get the "next" player ID available (and when the server restarts, the player IDs go back to 1). This does not provide monthly tracking, unless there is yet another player ID hidden some where.

I'm not saying that we need to go one way or the other, I'm just saying the right way to do it with maximum anonymity (only Player1 and Player2), or the way to maximize anonymity while still allowing monthly tracking (with the AES encrypted user name with a randomly generated token used as the key, and that key is not saved / given out). If we add player ratings to the logs, then we won't even need to track who is who to determine the '1337 stats', because the rating would be built right into the log.
 

Surgo

goes to eleven
is a Smogon Discord Contributoris a Site Content Manager Alumnusis a Programmer Alumnusis a Top Contributor Alumnusis an Administrator Alumnus
This is excellent. I was actually directed to ask you about getting exactly this data, without knowing you had already proposed uploading it!

Basically, I am attempting to create a pokemon-playing robot that doesn't really have any intelligence but just uses data like this to make decisions. This is exactly what I need.

No source code modification is necessary, all the anonymization is just a sed script away.
 
This is excellent. I was actually directed to ask you about getting exactly this data, without knowing you had already proposed uploading it!
Dammit! I was really hoping no one was going to actually want the logs so we could bypass the whole issue. Ah well... so much for laziness.

Basically, I am attempting to create a pokemon-playing robot that doesn't really have any intelligence but just uses data like this to make decisions. This is exactly what I need.
This is a really cool and ambitious project. You will theoretically have issues with players who name their pokemon things like "Latios used Surf," but I've already gotten to work on a "matchup matrix," which analyzes what happens when pokemon x goes against pokemon y. I've been meaning to post the results, but presenting the data is kinda hard, since it's four-dimensional, and I've been busy with other projects (like work).

No source code modification is necessary, all the anonymization is just a sed script away.
This is true unless you want the player rankings as well.



Thanks to your post I'll prioritize trying to figure out the anonymization stuff. I think I'm just going to make a ruling here and now:

Fully anonymized battle logs ("Player 1 vs. Player 2") will be available one month after I post the usage stats (so, in other words, November's logs will be available January 1). I will also make non-anonymized logs available individually, upon request, and only to people who have been vetted (probably only Smogon staff).

Not sure what month I'll start with. It depends on how busy I am in the next two weeks.
 
As much as I wanted to keep my post count at 666, I knew I couldn't do it forever.

I've written an anonymizer script. It's gonna need some testing, but so far it seems to successfully remove trainer names from the battle logs (but not pokemon nicknames).

Everything I've written so far is up on my shiny new github repo! Feel free to take a look and, if you have anything to add, feel free to contribute!
 

obi

formerly david stone
is a Site Content Manager Alumnusis a Programmer Alumnusis a Senior Staff Member Alumnusis a Smogon Discord Contributor Alumnusis a Researcher Alumnusis a Top Contributor Alumnusis a Battle Simulator Moderator Alumnus
You could also release a torrent of these logs to minimize server strain now that megaupload doesn't exist.
 
Then I have to seed the torrent.

Anonymized logs have continued to be in the back of my mind, but seeing as how no one's been bugging me for them, I haven't made it a high priority.
 

Aldaron

geriatric
is a Tournament Director Alumnusis a Battle Simulator Admin Alumnusis a Smogon Discord Contributor Alumnusis a Top Tiering Contributor Alumnusis a Top Contributor Alumnusis an Administrator Alumnus
As much as I wanted to keep my post count at 666, I knew I couldn't do it forever.

I've written an anonymizer script. It's gonna need some testing, but so far it seems to successfully remove trainer names from the battle logs (but not pokemon nicknames).

Everything I've written so far is up on my shiny new github repo! Feel free to take a look and, if you have anything to add, feel free to contribute!
You've posted nearly 1200 times in 11 months...impressive antar, impressive.
 

obi

formerly david stone
is a Site Content Manager Alumnusis a Programmer Alumnusis a Senior Staff Member Alumnusis a Smogon Discord Contributor Alumnusis a Researcher Alumnusis a Top Contributor Alumnusis a Battle Simulator Moderator Alumnus
I guess I should bug you for them.

I decided that it would probably be better if I had the ability to analyze the logs myself rather than having to bug you to write a script every time I have a new idea for Technical Machine.
 
If you bugged me several months ago, that would've been right in the middle of the PO<->PS transition, which is why I didn't register it.

Okay, so in this new PS era, we have some different issues:

  • PS logs contain a LOT more information than PO logs, namely FULL MOVESET INFORMATION. With PO logs, reverse-engineering someone's team from a log would've been a lot of work and with weird sets would have either involved a lot of guesswork or a lot of calcs. With PS logs, it's all explicitly included.
  • Unrated battles are not logged, so the privacy of tournament stuff is no longer an issue
  • I don't have access to trainer's unique numerical IDs, so I would either have to create my own alt<->ID lookup table (which isn't HARD) or just have everything be p1 vs. p2, which has the issue of not being able to do the "1 alt, 1 vote" system or "1337" stats (not an issue until Zarel stops resetting the damn ladders).
  • PS is *a lot* more active than Smogon PO ever was. Uncompressed, September's logs are 10.4 GB in size. This is going to make distribution a pain (I really don't want to be seeding a torrent that large from my home computer, and our firewall on the PS server is too restrictive to allow torrenting). Edit: Compressed (bzip2) it's down to 800 MB, which is too large for the free filehosting sites I know, and it's STILL a bit large for me to want to seed it continuously.

So what should I be doing to anonymize PS logs?

I'm thinking,
  1. Replace alts with unique numerical ID, with the alt<->ID lookup table being something my script generates, but which is not published
  2. Replace all nicknames with species names
  3. Remove all chat

I could also quite easily remove the moveset data as well, but I'd prefer to give people who want it access to that data. Thoughts?
 

obi

formerly david stone
is a Site Content Manager Alumnusis a Programmer Alumnusis a Senior Staff Member Alumnusis a Smogon Discord Contributor Alumnusis a Researcher Alumnusis a Top Contributor Alumnusis a Battle Simulator Moderator Alumnus
The move set data is one of the most important things for me to have, so that I can generate move set stats similar to team mate stats (but far more in-depth).

If you torrent, it wouldn't just be you uploading, though. I would upload to at least a ratio of 1.5. At the very least, people who were downloading it would also be uploading at the same time. I don't know how what the average ratio would be for seeding (excluding you from the stats).

As for compression, you may want to look into 7zip compression. My understanding is that it compresses text better than most other compression algorithms. http://www.codinghorror.com/blog/2009/02/file-compression-in-the-multi-core-era.html

Removing chat should also reduce the size of the logs a bit.
 
The move set data is one of the most important things for me to have, so that I can generate move set stats similar to team mate stats (but far more in-depth).
I like it, too, and I'd prefer not to remove it. The issue is whether it makes it too easy to counterteam and flat-up steal highly successful teams. I feel that putting a 1-3 month delay on releasing anonymized logs solves this problem.

Re: Torrenting

I get the sense that there are very few people who would actually be interested in getting these logs. I think in that case that the best way to distribute is via p2p transfer (ssh, skype, ftp) on a case-by-case basis. Basically, you ask me for the logs, I send them to you, and you agree to send them to another person if someone else wants them.

As for compression, you may want to look into 7zip compression. My understanding is that it compresses text better than most other compression algorithms. http://www.codinghorror.com/blog/2009/02/file-compression-in-the-multi-core-era.html
Yeah, I'm aware. But bzip2 and 7zip are order-of-magnitude comparable (and I was going for a rough estimate of compressed file size), and it's also quite a bit faster.

Removing chat should also reduce the size of the logs a bit.
A minuscule amount. Most battles have no chat, and those that do have minimal amounts of chat, especially compared to the length of the logs.
 

Users Who Are Viewing This Thread (Users: 1, Guests: 0)

Top