Understanding Data Privacy in the Netflix Prize: What Went Wrong?
In this post, we’d like to share a story that shows how a seemingly harmless dataset, without any obvious personally identifiable information (PII), ended up becoming a big privacy concern. We hope this will show how encumbered the issue of data privacy is and why it is such an important area of research today.
The content of this post is informed by the book “The Ethical Algorithm,” by Micheal Kearns and Aaron Roth.
What is anonymous data?
This is a great question because, on the surface, it seems there is a fairly good common-sense answer: data without PII. Nevertheless, real-world experiences show us data practitioners that it’s not that simple.
The question of data privacy is especially relevant in medical research because it is usually highly sensitive data of individual patients, which should not be freely shared, that is needed to perform investigations. Being able to share this information between institutions may be the only way of getting the benefits of large-scale data science and finding important correlations that can help treat patients and even save lives.
In the 1990s, in the state of Massachusetts, a government agency decided to release data summarizing hospital visits for every state employee. To protect the individual patients' identity, they removed the obvious PII like their name, address, and social security numbers. They kept a few data points they deemed would be useful to summarize the data: zip code, birthdate, and sex.
A PhD student at MIT at the time, Latanya Sweeney, wanted to prove that this anonymization process was not enough, so she purchased the voter rolls for the city of Cambridge, which included the zip code, birthdate, and sex of every voter of the city. Just by joining the two datasets with these three features, she managed to identify exactly the governor’s medical records. There was only one record in the medical dataset that had exactly the same values for the three features. Of course, sex, birth date, and zip code can’t be used separately to identify someone, but as a combined set of features, Sweeney estimated that 87% of the US population could actually be uniquely identified.
One could answer this case by just saying, “Ok, so let's just coarsen these features even more,” or even “anonymize zip code, birthdate, and sex altogether.” Of course, both would help, but as we will see, even much less obvious information can uniquely identify you with high confidence.
The Netflix competition
In 2006, Netflix launched a very famous competition, where they shared data of approximately 100 million individual movie ratings and the date of the individual ratings for roughly 500,000 users. All other data of the user was anonymized and reduced to a unique numeric ID, used only to know which ratings belonged to the same user. The competition was aimed at letting the public experiment with new techniques and find a recommendation algorithm that beat what Netflix had at the time by a 10% accuracy gain. The prize was set at USD $1,000,000.
Compared to the example of the medical records, Netflix had been very careful not to add any data that could identify a user, like zip code, birthdate, and of course, name, personal IDs, etc. Nevertheless, only a couple of weeks after the release, another PhD student, Arvind Narayanan, announced that they (together with his advisor Vitaly Shmatikov) had been able to connect many of the unique IDs in the Netflix dataset to real people by cross-referencing another publicly available dataset: the movie ratings in the IMDB site, where many users post publicly with their own names.
It’s tempting to say that if these people are publicly posting these reviews in IMDB, then it would hardly be a violation of privacy, but here again, we see how complicated this issue is: a person could have publicly commented on IMDB a few of the movies they watched, and maybe rated privately other movies they had watched, and that could expose more sensitive information like political inclination or even sexual orientation. This is why movie rental records in the US are protected by the Video Privacy Protection Act (passed in 1988), which states that video rental providers are liable for up to USD $2,500 in damages per customer whose records are released.
This ended up with a big lawsuit with the competition canceled, and the dataset being unavailable for further analysis. Of course, another problem here is that once a dataset is available, it can’t be made “unavailable” because the world already has it. Netflix can stop sharing it, but whoever downloaded it already has the data themselves.
The Netflix competition story shows how hard it is to really anonymize data because whoever is sharing the data should be able to know in advance what other sources of information may be available today (or in the future!) that, in conjunction, would de-anonymize the data. This seems like an impossible task, and Cynthia Dwork (one of the inventors of a technique called Differential Privacy) says “anonymized data isn’t,” meaning data is either not anonymized or so much has been removed from it that it isn’t data anymore.
Some thoughts
We hope this story helps show how a seemingly simple task like anonymizing a dataset is actually a very hard problem to solve. We just scratched the surface of data privacy in this article. Hopefully, we will be sharing other articles in the future with some existing approaches that try to solve this problem by adding very clever algorithms either to data collection or by redacting information from individual records.
An important rule of thumb is to understand that none of the existing techniques to protect data privacy will come for free. A way of illustrating this that I find easy to understand is, for example, adding ethical goals directly into our algorithms as constraints to the optimization goals, which will probably come with a cost for the business metric, for example, accuracy.
Of course, “traditional” solutions, such as laws and regulations around data and algorithms, are also a very important part of making this world a better place in terms of our data privacy, but we should not forget about how we, as data practitioners, can help by designing social constraints directly into our algorithms.
Originally published on the Emi Labs Tech - Ravens by Nico Ortez, one of the original Emi team members. (Slight modifications have been made to improve readability.)