jrtom: (Default)
[personal profile] jrtom
From [livejournal.com profile] fdmts: A Face Is Exposed for AOL Searcher No. 4417749

In essence, AOL recently released 20 million anonymized search queries. However, it turns out that it's not that hard to figure out who someone is based on what they're searching for, as the article details.

I deal with this sort of issue on a continuing basis as part of my profession (among other things, I do research on learning models for social network analysis). In some cases, the data is inarguably public: no one really minds if I analyze the network defined by the "writes-a-paper-with" relation. But in other cases, it's been drilled into the heads of researchers--supposedly--that anonymization is required in order to release data, and often in order to get it in the first place.

The problem is, of course, that clearly anonymization isn't sufficient in this case.

It's a tricky problem, of course; we can't do research if we don't have data to work with, and there are valuable things that can be learned from such data that _don't_ involve violating peoples' privacy. I guess the question is, if it's necessary to collect such data in the first place, and to study it, is there anything in addition to anonymization that can be done to prevent this sort of 'reverse-engineering' of someone's identity? (Obviously AOL shouldn't have released the data publically in the first place…but the point is that by current standards they probably thought that it wouldn't do any harm because it was anonymized.) Aggregating it isn't the answer, because then you lose much of the information that made the data valuable in the first place.

*ponder*

HIPPA

Date: 9 August 2006 23:10 (UTC)
From: [identity profile] fdmts.livejournal.com
One might be able to learn from the world of medical research, including HIPPA (The Health Information Privacy Protection Act).

In this world, all medical records are, by default, private. Releasing them in any form, without authorization, is a federal offense. It is the job of the institution to justify (by federal audit) that they are adequately protecting patient information. It is the job of the researcher to justify (by application to an internal review board) that their intended use will not get the institution's license pulled.

It works pretty well. "Trust me, I'm a scientist" is never enough. Usually, researchers must detail exactly what data they are going to extract, and how that data will be anonymized. It requires researchers to state explicitly what they are studying, and how, ahead of time.

Re: HIPPA

Date: 9 August 2006 23:36 (UTC)
From: [identity profile] jrtom.livejournal.com
I expect that you have more information about this, from one side at least, than I do; I haven't talked to Dad about HIPPA much.

That said, as far as I can tell, HIPPA has resulted in the instant creation of documents used by medical organizations that basically ask you to accept that the medical organization in question can type your info on a sheet of paper and sail it out the window as a paper airplane if they feel like it. I exaggerate somewhat...but I recently ran into a real gem of the genre when I went for a company-sponsored health screening which was run by J. Random Medical Organization. They handed me a 4-page document (which we had to sign to acknowledge that we'd received--most people didn't even bother to ask for their copy) which detailed the many, many ways in which they could use this information without further consultation.

Food for thought: researchers in academia have to get human study proposals past something called an IRB (I-something Review Board). Anonymization of the subjects' names is a given, but that's just a start, and if the IRB goes thumbs down, the study doesn't happen.

On the other hand, companies collecting data on their customers don't, as far as I can tell, have to do a damned thing except wave their hands about what use they might make of the data, and bury it deep in a EULA or equivalent. That is: there's effectively no oversight that the company doesn't provide itself.

Anyway, the social network research community has had discussions about the fact that anonymization is close to useless if the network has enough information in it--and my understanding is that IRBs haven't really caught up with this problem yet. Clearly this is another example of the same problem: there's more information in these data sets than people had really previously realized, or at least been able to extract easily.

Something I meant to mention in my original post: the Enron email data set. If you're not aware of it, it basically consists of the email folders (or at least part of them) for quite a number of people sent over a period of some years. It's now widely available on the Internet, in several different versions (depending on what means of cleaning it up, resolving entities, etc. you prefer). While it seems to be pretty well established that some of the people at the top were running some shady scams, it includes the email of thousands of people who probably had nothing to do with this. And it's _not_ anonymized.

This data set has been used by all sorts of research groups who do data mining, especially those (like me) who are interested in data sets from which one can derive social networks. No one uses names in their papers or talks...with the exception of people like Ken Lay and Jeff Skilling...who got dissected years before the trial ever happened.

Or another example: I have had access to another email data set; this one is based only on email server log data, so there's no content or headers or anything...but you can learn a surprising amount about an organization just from looking at who's talking to whom, and when.

All this rambling aside, where I was really going was this: let's suppose that you think that it's ethically OK to do research on peoples' email, or email patterns, or search logs. What other measures--in addition to anonymization--should be taken on such data sets so as to protect the subjects' privacy? Or do we simply need to depend on the ethical behavior of the people studying them if the data set is sufficiently rich in the first place?

Re: HIPPA

Date: 10 August 2006 01:26 (UTC)
From: [identity profile] fdmts.livejournal.com
let's suppose that you think that it's ethically OK to do research on peoples' email, or email patterns, or search logs. What other measures--in addition to anonymization--should be taken on such data sets so as to protect the subjects' privacy? Or do we simply need to depend on the ethical behavior of the people studying them if the data set is sufficiently rich in the first place?

Privacy is an interesting one. I don't think that we have any *right* to privacy, but we certainly *expect* it. That will change in the next 20 years. You and I are at the forefront of that wave. We think a little bit about putting our diaries online. My little sister thinks it's totally normal. My mother thinks it's odd.

Which other measures? Good question. I think that specific identifying details should be "blurred." "Town A," "Town B," and so on. Blonde, brunette, and redhead are generic enough to leave in. "Riverside RI" is a small place, especially if your dataset consists of 30 to 40 year olds who drive Jeeps.

Dunno. Need to think about it more.

(no subject)

Date: 10 August 2006 13:51 (UTC)
From: [identity profile] gwyd.livejournal.com
I'm rather glad you posted. I had a very realistic dream last night about hypgnosis throwing you a wake after you got run over by a drunk driver.

(no subject)

Date: 10 August 2006 16:08 (UTC)
From: [identity profile] jrtom.livejournal.com
Nope, still here. :)

Although if someone were going to throw me a wake, I can think of few more qualified than [livejournal.com profile] hypgnosis. :)

Profile

jrtom: (Default)
jrtom

May 2011

S M T W T F S
1234567
891011121314
1516 1718192021
22232425262728
29 3031    

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated 30 December 2025 16:33
Powered by Dreamwidth Studios