260 likes | 393 Views
You are what you say: Privacy risks of public mentions. Dan Frankowski et al. University of Minnesota SIGIR 2006 Presentor: Chun-Yuan Teng. Motivation. “Public data” + “Private data” + “IR Algorithm” = Privacy risk. Example of privacy risk.
E N D
You are what you say: Privacy risks of public mentions Dan Frankowski et al. University of Minnesota SIGIR 2006 Presentor: Chun-Yuan Teng Natural Language Processing Lab National Taiwan University
Motivation • “Public data” + “Private data” + “IR Algorithm” = Privacy risk Natural Language Processing Lab National Taiwan University
Example of privacy risk • Privacy risk: Link datasets with overlapping users • “blog” + “purchase history” = “someone” • Ex: 吳若權 or 紫微斗數 Natural Language Processing Lab National Taiwan University
Examples of privacy encroachment • People are judged by their preference • Rating + Mention in porn in forum? Natural Language Processing Lab National Taiwan University
Research questions • Risks of dataset release • What are the risks to user privacy when releasing a dataset? • Altering the dataset • How can dataset owners alter the dataset they release to preserve user privacy? • Self defense • How can users protect their own privacy? Natural Language Processing Lab National Taiwan University
Experimental setup • Ratings • Large • 140K users. max 6K rats, average 90, median 33. • 9K movies. max 49K rats, average 1,403, median 207 • 12.6M ratings • Forum mentions • Small • 133 forum posters • 1,685 different movies • 3,828 movie mentions Natural Language Processing Lab National Taiwan University
RQ1: Risks of dataset release • How to evaluate the risks? • What’s the risky algorithms? Natural Language Processing Lab National Taiwan University
K-anonymity & K-identification • K-anonymity (In Cryptography) • Sweeney: “A dataset release provides k-anonymity protection if the information for each person contained in data cannot be distinguished from k-1 individuals in the data” • K-identification • K-identification is a measure of how well an algorithm can narrow each user in a dataset to one of k users in another dataset Natural Language Processing Lab National Taiwan University
K-identification (cont.) • Likely list • u1, s1 • u2, s2 • u3, s3 (t) • u4, s4 • … • Above, t is 3-identified, also 4-identified, 5-identified, etc., but NOT 2-identified • We know target user t in ratings data, too • t is k-identified if at position k or higher on the likely list. • In paper, k=1,5,10,100. We’ll talk about 1-identification, because it’s the scariest. Natural Language Processing Lab National Taiwan University
An observation of data • Low Rated item may be a good indicator Natural Language Processing Lab National Taiwan University
Algorithms to identify users • Set Intersection algorithm • TF-IDF algorithm • Scoring algorithm • Scoring algorithm with ratings Natural Language Processing Lab National Taiwan University
Set Intersection algorithm Natural Language Processing Lab National Taiwan University
Set Intersection algorithm • Find users who rate EVERY movie the target user mentioned • They all have same likeliness score • Ignore rating value entirely • RESULT: 1-identification rate: 7% • MEANING: 7% of the time there was one user at the top of the likely list, and it was the target user • Room for improvement • For target user with many mentions, no one possible Natural Language Processing Lab National Taiwan University
TF-IDF algorithm • Score each user by similarity to the target user. Score more highly if • User has rated more mentions of target • User has rated mentions of rarely rated movies • For us: “word” is a movie, “document” (bag of words) is a user • Score is cosine similarity to the target user • RESULTS: 1-ident rate of 20% (compared to 7% from Set Int.) • Room for improvement • over-weights any mention for ratings user who rated few movies– high-scoring users have 4 ratings and 1 mention Natural Language Processing Lab National Taiwan University
Scoring algorithm • Emphasizes mentions of rarely-rated movies, de-emphasizes number of ratings a user has • A user who has rated a mention is 10-20 times more likely to be the target user than one who has not Natural Language Processing Lab National Taiwan University
Examples • Example • Target user t mentioned A, B, C, rated 20, 50, 1000 times (from 10,000 users) • User u1 rated A, user u2 rated B, C • u1 score: 0.9981 * 0.05 * 0.05 = 0.0025 • u2 score: 0.05 * 0.9501 * 0.9001= 0.043 • u2 more likely to be target t • Rating a mention is good, rare even better Natural Language Processing Lab National Taiwan University
Scoring algorithm with rating • The same as above algorithm • Add threshold to add the rating feature Natural Language Processing Lab National Taiwan University
Percent of k-identified Natural Language Processing Lab National Taiwan University
RQ2: altering the dataset • Perturbation: Change rating value • Rating is not needed • Generalization: group items • Dataset becomes less useful • Suppression: hide data • Using in following experiments Natural Language Processing Lab National Taiwan University
RQ2: Altering the dataset • We won’t modify forum data– users wouldn’t like it. Focus on ratings data • Rarely-rated items are identifying • IDEA: Release a ratings dataset suppressing all “rarely-rated” items Natural Language Processing Lab National Taiwan University
RQ2: Altering the dataset Natural Language Processing Lab National Taiwan University
RQ3: Self Defense • The question is how user protect their own privacy • Suppression: suppress rare-rated movie • May not be accepted by user • Misdirection Natural Language Processing Lab National Taiwan University
Suppression • Not significant if more than 20% Natural Language Processing Lab National Taiwan University
Misdirection • Mention popular items is more effective • Mention a popular item, more users increase their score Natural Language Processing Lab National Taiwan University
Conclusion • A new problem in IR • Interesting and hard • Hard to preserve privacy • You need to suppress large data Natural Language Processing Lab National Taiwan University