From
fdmts: A Face Is Exposed for AOL Searcher No. 4417749
In essence, AOL recently released 20 million anonymized search queries. However, it turns out that it's not that hard to figure out who someone is based on what they're searching for, as the article details.
I deal with this sort of issue on a continuing basis as part of my profession (among other things, I do research on learning models for social network analysis). In some cases, the data is inarguably public: no one really minds if I analyze the network defined by the "writes-a-paper-with" relation. But in other cases, it's been drilled into the heads of researchers--supposedly--that anonymization is required in order to release data, and often in order to get it in the first place.
The problem is, of course, that clearly anonymization isn't sufficient in this case.
It's a tricky problem, of course; we can't do research if we don't have data to work with, and there are valuable things that can be learned from such data that _don't_ involve violating peoples' privacy. I guess the question is, if it's necessary to collect such data in the first place, and to study it, is there anything in addition to anonymization that can be done to prevent this sort of 'reverse-engineering' of someone's identity? (Obviously AOL shouldn't have released the data publically in the first placeā¦but the point is that by current standards they probably thought that it wouldn't do any harm because it was anonymized.) Aggregating it isn't the answer, because then you lose much of the information that made the data valuable in the first place.
*ponder*
In essence, AOL recently released 20 million anonymized search queries. However, it turns out that it's not that hard to figure out who someone is based on what they're searching for, as the article details.
I deal with this sort of issue on a continuing basis as part of my profession (among other things, I do research on learning models for social network analysis). In some cases, the data is inarguably public: no one really minds if I analyze the network defined by the "writes-a-paper-with" relation. But in other cases, it's been drilled into the heads of researchers--supposedly--that anonymization is required in order to release data, and often in order to get it in the first place.
The problem is, of course, that clearly anonymization isn't sufficient in this case.
It's a tricky problem, of course; we can't do research if we don't have data to work with, and there are valuable things that can be learned from such data that _don't_ involve violating peoples' privacy. I guess the question is, if it's necessary to collect such data in the first place, and to study it, is there anything in addition to anonymization that can be done to prevent this sort of 'reverse-engineering' of someone's identity? (Obviously AOL shouldn't have released the data publically in the first placeā¦but the point is that by current standards they probably thought that it wouldn't do any harm because it was anonymized.) Aggregating it isn't the answer, because then you lose much of the information that made the data valuable in the first place.
*ponder*
Re: HIPPA
Date: 10 August 2006 01:26 (UTC)Privacy is an interesting one. I don't think that we have any *right* to privacy, but we certainly *expect* it. That will change in the next 20 years. You and I are at the forefront of that wave. We think a little bit about putting our diaries online. My little sister thinks it's totally normal. My mother thinks it's odd.
Which other measures? Good question. I think that specific identifying details should be "blurred." "Town A," "Town B," and so on. Blonde, brunette, and redhead are generic enough to leave in. "Riverside RI" is a small place, especially if your dataset consists of 30 to 40 year olds who drive Jeeps.
Dunno. Need to think about it more.