It looks like Netflix might be spending more than $1 million on a recent campaign to improve its recommendation engine. The movie rental company recently held a contest that successfully improve its recommendation by more than 10%. But now an in-the-closet lesbian woman is suing the company for privacy invasion, saying that she could have been outed due to Netflix sharing data that wasn’t quite so anonymous.

While her claims may be spurious, this could have legal implications for the ways user information is shared and stored online.

The Doe v. Netflix lawsuit alleges that film preferences are personal information that Netflix does not properly protect. From the filing:

“Jane Doe, a lesbian, who does not want
her sexuality nor interests in gay and lesbian themed films broadcast
to the world, seeks anonymity in this action.

To some, renting a movie such as
“Brokeback Mountain” or even “The Passion of the Christ” can be a
personal issue that they would not want published to the world.

The Brokeback Mountain Factor is
described thusly: Our secrets, great or small, can now without our
knowledge hurtle around the globe at the speed of light, preserved
indefinitely for future recall in the elec- tronic limbo of computer
memories. These technological and economic changes in turn have made
legal barriers more essential to the preservation of our privacy.”

The problem arose when Netflix sent the information to contestants competing to improve Netflix’s recommendation engine and win $1 million earlier this year.

According to Wired:

“In order to get a better movie recommendation algorithm, the online DVD rental company gave more than 50,000 Netflix Prize
contestants two massive datasets. The first included 100 million movie
ratings, along with the date of the rating, a unique ID number for the
subscriber, and the movie info. Based on this data from 480,000
customers, contestants had to come up with a recommendation algorithm
that could predict 10 percent better than Netflix how those same
subscribers rated other movies.”

But given that information, it turns out to be pretty easy to find out about Netflix’s users. A few weeks into the contest, two University of Texas researchers —
Arvind Narayanan and Vitaly Shmatikov — were able to identify Netflix customers (and their political leanings and sexual orientation) by
comparing supposedly anonymous Netflix reviews with ones posted on IMDB.

According to the lawsuit, “the Brokeback Mountain
factor” could be combined with other user information that can be used by marketing companies to target and categorize consumers.

The suit seeks more than $2,500 in damages for each of more than 2 million Netflix customers, which could break Netflix’s contest dreams pretty quickly.

In addition to the monetary request, the suit wants to halt Netflix from
launching a second contest to improve its recommendation engine. For this contest, the company is set to release customer information such as user ZIP codes, ages and gender and movie ratings. Althought user names will be replaced with ID numbers, it won’t be hard to identify individuals.

Paul Ohm from Princeton’s Center for Information Technolgy Policy blog, identifies the problem like this:

“Researchers have known for more than a decade that gender plus ZIP code
plus birthdate uniquely identifies a significant percentage of
Americans (87% according to Latanya Sweeney’s famous study.) True,
Netflix plans to release age not birthdate, but simple arithmetic shows
that for many people in the country, gender plus ZIP code plus age will
narrow their private movie preferences down to at most a few hundred
people. Netflix needs to understand the concept of “information
entropy”: even if it is not revealing information tied to a single
person, it is revealing information tied to so few that we should
consider this a privacy breach.”

While this specific lawsuit may be seen as overblown or opportunistic, it’s not hard to see how other supposedly anonymous data may get companies into trouble online.

The problem lies not really in revealing what movies people
watch online (though whoever helped Crash become the most popular Netflix film should be embarassed), but how different data sets, when combined, can reveal much more than individuals would like to share on the internet.