It looks like Netflix might be spending more than $1 million on a recent campaign to improve its recommendation engine. The movie rental company recently held a contest that successfully improve its recommendation by more than 10%. But now an in-the-closet lesbian woman is suing the company for privacy invasion, saying that she could have been outed due to Netflix sharing data that wasn't quite so anonymous.
While her claims may be spurious, this could have legal implications for the ways user information is shared and stored online.
The Doe v. Netflix lawsuit alleges that film preferences are personal information that Netflix does not properly protect. From the filing:
"Jane Doe, a lesbian, who does not want her sexuality nor interests in gay and lesbian themed films broadcast to the world, seeks anonymity in this action.
To some, renting a movie such as “Brokeback Mountain” or even “The Passion of the Christ” can be a personal issue that they would not want published to the world.
The Brokeback Mountain Factor is described thusly: Our secrets, great or small, can now without our knowledge hurtle around the globe at the speed of light, preserved indefinitely for future recall in the elec- tronic limbo of computer memories. These technological and economic changes in turn have made legal barriers more essential to the preservation of our privacy."
The problem arose when Netflix sent the information to contestants competing to improve Netflix's recommendation engine and win $1 million earlier this year.
According to Wired:
"In order to get a better movie recommendation algorithm, the online DVD rental company gave more than 50,000 Netflix Prize contestants two massive datasets. The first included 100 million movie ratings, along with the date of the rating, a unique ID number for the subscriber, and the movie info. Based on this data from 480,000 customers, contestants had to come up with a recommendation algorithm that could predict 10 percent better than Netflix how those same subscribers rated other movies."
But given that information, it turns out to be pretty easy to find out about Netflix's users. A few weeks into the contest, two University of Texas researchers — Arvind Narayanan and Vitaly Shmatikov — were able to identify Netflix customers (and their political leanings and sexual orientation) by comparing supposedly anonymous Netflix reviews with ones posted on IMDB.
According to the lawsuit, "the Brokeback Mountain factor" could be combined with other user information that can be used by marketing companies to target and categorize consumers.
The suit seeks more than $2,500 in damages for each of more than 2 million Netflix customers, which could break Netflix's contest dreams pretty quickly.
In addition to the monetary request, the suit wants to halt Netflix from launching a second contest to improve its recommendation engine. For this contest, the company is set to release customer information such as user ZIP codes, ages and gender and movie ratings. Althought user names will be replaced with ID numbers, it won't be hard to identify individuals.
Paul Ohm from Princeton's Center for Information Technolgy Policy blog, identifies the problem like this:
"Researchers have known for more than a decade that gender plus ZIP code plus birthdate uniquely identifies a significant percentage of Americans (87% according to Latanya Sweeney's famous study.) True, Netflix plans to release age not birthdate, but simple arithmetic shows that for many people in the country, gender plus ZIP code plus age will narrow their private movie preferences down to at most a few hundred people. Netflix needs to understand the concept of "information entropy": even if it is not revealing information tied to a single person, it is revealing information tied to so few that we should consider this a privacy breach."
While this specific lawsuit may be seen as overblown or opportunistic, it's not hard to see how other supposedly anonymous data may get companies into trouble online.
The problem lies not really in revealing what movies people watch online (though whoever helped Crash become the most popular Netflix film should be embarassed), but how different data sets, when combined, can reveal much more than individuals would like to share on the internet.