210 likes | 293 Views
Turning Privacy Leaks into Floods : Surreptitious Discovery of Social Network Friendships. Michael T. Goodrich Univ. of California, Irvine joint w/ Arthur U. Asuncion. Problem Definition. Discover the friendships. Problem Definition. Discover the friendships. Leveraging Information Leaks.
E N D
Turning Privacy Leaks into Floods: Surreptitious Discovery of Social Network Friendships Michael T. Goodrich Univ. of California, Irvine joint w/ Arthur U. Asuncion
Problem Definition • Discover the friendships
Problem Definition • Discover the friendships
Leveraging Information Leaks • Leak: Friendship list can be viewed by friends-of-friends. This allows: • Given two people, X and Y, we can tell whether X and Y have a friend in common. • Leverage: We use this to discover the friends list for members of the network
Abstracting the Problem • Viewed abstractly, we are trying to learn binary attribute vectors.
Group Testing • Input:n items, numbered 0,1, …, n-1, at most d of which are defective. • Output: the indices of the defective items. • Items can be grouped into subsets, each of which can be tested to see it contains a defective item or not. • Goal: minimize the total number of tests • Original problem: Testing blood samples.
Testing Schemes • Non-adaptive: All tests must be done in parallel • Adaptive: Tests can be done sequentially • Adaptive is easier, but our framework requires a non-adaptive approach
Facebook Application • Each member has a “vector” of friendships • For any member M, the system returns a bit for whether M has a friend in common with the attacker, even if M restricts this information to friends-of-friends • We can use non-adaptive scheme to learn friendship relationships in any sub-community in Facebook.
DNA Application • DNA sequences are stored in a database, D. • For any sequence Q, the database returns a score for how close Q is to each sequence in D • We form a binary vector w.r.t. places where mutations happen relative to a reference string R • We can use non-adaptive scheme to learn DNA strings in D.
Netflix Application • Movie ratings vectors are stored in a database, D. • For any vector V, the database returns a score for how close V is to each vector in the database • We can form a binary attribute vector for movies • We can use non-adaptive scheme to learn ratings vectors in D.
n t M Matrix View of Testing • A non-adaptive testing regimen can be viewed as a t x n binary matrix M: • M[i,j] = 1 if and only if test i includes item j • M is d-disjunct if the Boolean sum of any d columns does not contain any other column. • An item is defective iff all its tests are positive • M is d-separable if the Boolean sums of each set of at most d columns are distinct (harder analysis algorithm)
Randomized Approach • Use a randomized approach motivated by Bloom filtering. • Construct a matrix M, but relax requirements • Given a set D of d columns in M and a column j, say j is distinguishable from D if there is a row i such that M[i,j]=1 but M[i,j’]=0 for each j’ in D. • M is D-distinguishable if, for a particular collection D of subsets, the matrix M will find them distinguishable.
Constructing the Matrix • Given t (set in the analysis), let M be a 2t x n matrix defined randomly: • For each column j, choose t/d rows of M at random and set these entries to 1. • that is, we “inject” j into those t/d tests
Technique for Social Networks • Insert a small set of network members • Form connections with random network members • Test common-friends condition for the fictional members Image from http://www.politicsforum.org/images/flame_warriors/flame_53.php
Exploiting Sparse Data Sets Histogram of differences from R: Table of sizes, lengths, and differences from R:
Number of Tests Needed in Theory 1st column: To clone entire database with high probability 2nd column: To clone sparsest 50% of database with high probability 3rd column: To clone entire database with probability 1
Different Choices for “d” • Tradeoff: • The smaller the “d”, the faster we can recover sparse vectors • With very small “d”, it can take a long time to recover the vectors that are not so sparse. • But most vectors are sparse so we generally want a pretty small “d” Attack on a Netflix user who has rated 98 movies. With smaller “d”, the rate of convergence is faster.
Here we vary “d” on the x-axis and we plot the mean and median number of tests required across the vectors in the database. Different choices for “d”
More tests are needed for vectors which are further from the reference R (but note most vectors are close to R). We also see the tradeoff between various “d” Distance from R
Thresholding Behavior • There are critical values of our estimated value for d:
Conclusion and Future Work • We have presented a way to turn privacy leaks into floods, with a number of applications: • Social networks • DNA databases • Ratings vectors • Future work: extend our approach to non-binary vectors (e.g., friends and foes)