500 likes | 624 Views
Exploring Linkability of User Reviews . Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine m almisha,gts@ics.uci.edu. Increasing P opularity of Reviewing Sites Yelp, more than 39M visitors and 15M reviews in 2010. category. Rating.
E N D
Exploring Linkabilityof User Reviews MishariAlmishari and Gene Tsudik Computer Science Department University of California, Irvine malmisha,gts@ics.uci.edu
Increasing Popularity of Reviewing Sites • Yelp, more than 39M visitors and 15M reviews in 2010
category Rating
How Privacy apply to Reviews? • Traceability • Linkability of Ad hoc Reviews • Linkablility of Several Accounts
Contribution • Extensive Study to Measure privacy/linakability in user reviews • Propose models that adequately identify authors
IR: Identified Record AR: Anonymous Record IR AR IR AR IR AR AR IR
TOP-X Linkability Anonymous Record Size (AR) 1, 5, 10, 20,…60 X: 1 and 10 Matching Model Identified Record Size (IR)
Dataset • 1 Million Reviews • 2000 Users • more than 300 review
Methodology • Naïve Bayesian Model • Kullback-Leibler Model • Symmetric Version
Naïve Bayesian (NB) Anonymous Record (AR) Identified Record (IR) Decreasing Sorted List of IRs
Kullback-Leibler Divergence(KLD) Anonymous Record (AR) Identified Record (IR) Increasing Sorted List of IRs
Tokens • Unigram: ‘a’, ….’z’ • Digram: ‘aa’, ‘ab’,…,’zz’ • Rating :1,2,3,4,5 • Category: restaurant, Beauty and Spa, Education
NB -Unigram Size 60, LR 83%/ Top-1 LR 96% Top-10
KLD - Unigram Size 60, LR 83%/ Top-1 LR 96% Top-10
NB Digram Size 20, LR 97%/ Top-1 Size10, LR 88%/ Top-1
KLD Digram Size 60, LR 99%/ Top-1 Size 30, LR 75%/ Top-1
Combining in NB model Straightforward • P(Rating|IR), P(Category|IR) • But for KLD? • Weighted Average
First, Combine Rating and Category 0.5 Second, Combine non-lexical and lexical 0.997/0.97 for Unigram/Digram
Rating, Category, and Unigram - NB Gain, up to 20% Size 30, 60 % To 80% Size 60, 83 % To 96%
Rating, Category, and Unigram - KLD Gain, up to 12% Size 40, 68 % To 80% Size 60, 83 % To 92%
TOP-X Linkability Anonymous Record Size (AR) X: 1 and 10 Matching Model Identified Record Size (IR)
TOP-X Linkability Anonymous Record Size (AR) X: 1 and 10 Matching Model Identified Record Size (IR)
Restricted IR - NB Affected by IR size
Restricted IR - KLD Performed better for smaller IR Size 20 or less, improved The rest, comparable
TOP-X Linkability Anonymous Record Size (AR) X: 1 and 10 Matching Model Identified Record Size (IR)
Anonymous Records (AR’s) Matching Model Identified Records (IR’s)
✔ ✖ ✔ ✖ ✖ ✔ ✖ ✖ ✖ ✔
MatchAll - Restricted Gain, up to 16% Size 30, From 74% To 90%
Matchall - Full Gain, up to 23% Size 20, From 35% To 55%
Changing it to: + Review Length 0.5
Results – Improvement (3) Gain up to 5% Size 10, 89% To 92% Size 7, 79% To 84%
Discussion • Implications • Cross-Referencing • Review Spam • Non-Prolific Users • Gradually becomes prolific • IR of 20, Link Around 70% • Anonymous Record Size • Linkability high even for small (92% for AR of 10) • 60 only 20% of min user contribution
Discussion (cont.) • Unigram Token • Very Comparable for larger AR • Entail less resources in the attach 26 VS 676
Future Directions • Improving more for Small AR’s • Other Probabilistic Models • Using Stylometry • Exploring Linkability in other Preference Databases • More than one AR for different Users: Exploring it more
Conclusion • Extensive Study to Assess Linkability of User Reviews • For large set of users • Using very simple features • Users are very exposed even with simple features and large number of authors