300 likes | 477 Views
DE-ANONYMIZING SOCIAL NETWORKs. Ilan Komargodski Foundations of Privacy. Overview. Introduction Related Work Model and Definitions The Attack Experiments Conclusion. Introduction. Why Publish Social Networks Data? (Why even bother?). Reaserch - academic and government data-mining:
E N D
DE-ANONYMIZING SOCIAL NETWORKs IlanKomargodski Foundations of Privacy
Overview • Introduction • Related Work • Model and Definitions • The Attack • Experiments • Conclusion
Why Publish Social Networks Data?(Why even bother?) • Reaserch - academic and government data-mining: • Sociology • Economics • Epidemiology • Security • Advertising partners • Third-party applications • Aggregation projects
What is a Social Network Graph? Social network graph consists of nodes (individuals), edges (relations) and information associated with each node and edge. Attributes attached to nodes are very sensitive. Credit card number Note: The existence of an edge between two nodes can also be very sensitive. Name Had a romantic relationship
Social Network Operator Needs to publish network information that doesn’t reveal any “personal data”. Anonymization Anonymity Privacy W25LO YCX38 O0Y6W 3TCB2 BZDU2 TIPN8 8C5AR 41OCD USFMI A9DEY
Related Work Backstorm[2] present two active attacks on edge privacy assuming that the adversary is able to change the network before its release. Creates O(logN) nodes that re-identify O(log2N) users. • Disadvantages: • Limited to online social networks (even here difficult) • No control on incoming edges • Easy to detect (pattern stands out)
Data Release Model • Select subset of nodes and subsets of node and edge attributes to be released • Compute the induced subgraph on Vsan • Remove some edges and add fake edges (perturbation) • Summary: a sanitized subset of nodes and edges with the corresponding attributes.
The Attacker Model • Many types of attackers (government, marketing, spammers etc) • Usually have access to different network Saux whose membership partially overlaps with S • Might be extracted from S (automatic crawling, malicious third-party application) • Aggregation projects • Collude with an operator of a different network Saux Saux S S Saux S
Aux. Information Def. • Saux: a graph Gaux={Vaux,Eaux} and a set of probability distributions AuxX and AuxY , one for each attribute of every node in Vaux and each attribute of every edge in Eaux. For example, P[X = “friendship”] = 0.8 P[X = “contact”] = 0.2 • Seeds: attacker possesses detailed information about a negligible number of members of S. • How?
Re-identification Algorithm Def. • Re-identification algorithm - a probabilistic mappingµ:Vsan×Vaux → [0, 1] where µ(vaux, vsan) is the probability that vaux is mapped to vsan. • Privacy breach – for node vaux in Vaux let µREAL(vaux)=vsan. We say that the privacy of vsan is breached w.r.t adversary Adv and privacy parameter δ, if for some attribute X, that is private: Adv[X,vaux,X[vaux]] – Aux[X, vaux, X[vaux]] > δ
Success of an Attack • Fraction of nodes re-identified. Results in a fairly meaningless metric. • Weight proportional to importance in the network . E.g degree centrality, where each node is weighted in proportion to its degree.
Seed Identification TARGET AUXILIARY • Input: • A clique of k nodes known to be common to both graphs. • Degree of each of these nodes (AUXILIARY) • Number of common neighbors for each pair of nodes (AUXILIARY) • Complexity: • Exponential in k. BUT: • If the degree is bounded by d, then the complexity is O(nd(k-1)). • Heavily input dependant. (High running time => Large number of matches) • Algorithm: • Search for a unique k-clique (on TARGET) that has: • matching degrees (with some error factor) • common neighbor counts • Outputs µs, a partial mapping.
Propagation Algorithm • Input: two graphs and a partial seed mapping • Output: deterministic 1-1 mapping Each iteration starts with the accumulated mapping. Picks an unmapped node u in V1 and computes a score for each unmapped node v in V2 score Important: The algorithm finds new mappings using the topological structure of the network and the feedback from previous mappings.
Propagation Algorithm Cont. • Eccentricity measures how much an item in a set X “stands out” from the rest. Defined by: • Edge Directionality – mapping scores for nodes u and v are computed separately for incoming and outgoing edges (and then summed). • Node Degrees – the above works in favor of high-degree nodes => divide by square root of degree. • Revisiting Nodes – as the algorithm progresses, #mapped nodes increases & errors decrease. • Reverse Match – every match is matched in both directions standard deviation cosine similarity
func eccentricity(items): return (max(items) - max2(items)) / std_dev(items) func update_scores_by_direction(lgraph, rgraph, mapping, lnode, scores, direction = ' → ' or ' ← ') for edge (lnbr direction lnode) in lgraph.edges: if lnbr not in mapping: continue rnbr = mapping[lnbr] for edge (rnbr direction rnode) in rgraph.edges: if rnode in mapping.image: continue scores[rnode] += (1 / rnode.direction_degree) ^ 0.5 func update_scores(lgraph, rgraph, mapping, lnode): scores = [0 for rnode in rgraph.nodes] # initialize scores update_scores_by_direction(lgraph, rgraph, mapping, lnode, scores, ' → ') # in update_scores_by_direction(lgraph, rgraph, mapping, lnode, scores, ' ← ') # out return scores func find_match(lgraph, rgraph, mapping, node, theta): scores[node] = update_scores(lgraph, rgraph, mapping, node) if eccentricity(scores[node]) < theta: return NULL return the node from rgraph.nodes that maximizes scores[node] func propagation_step(lgraph, rgraph, mapping): for lnode in lgraph.nodes: rnode = find_match(lgraph, rgraph, mapping, lnode, theta) reverse_match = find_match(rgraph, lgraph, invert(mapping), rnode, theta) if rnode != NULL and reverse_match == lnode: mapping[lnode] = rnode while not convergence do: propagation_step(lgraph, rgraph, seed_mapping) Node Degrees Edge Directionality Eccentricity Reverse Match Revisiting Nodes
Complexity • Without revisiting nodes and reverse matches O(|E1|d2) • Revisiting: assuming that a node v is revisited only if then number of already-mapped neighbors of v has increased: O(|E1|d1d2) • Reverse mapping: O((|E1| + |E2|)d1d2)
Experiments • In both API exposes: • Mandatory username • Optional name • Optional location
Ground Truth • Based on exact match in username/name, or the score δ(name, username, location) is high enough. • RESULT: 27,000 mappings, called µ(G). • Seed is 150 pairs of randomly selected mappings from µ(G) each of degree > 80 in Gaux .
Seed Identification Node Overlap: 25% Edge Overlap: 50%
Node Re-Identification – Effect of Noise Node Overlap: 25% #Seeds: 50
Experiment Results • 30.8% of the mappings were re-identified correctly. • 57% were not identified • 12.1% were identified incorrectly • 41% of them were mapped to distance 1 nodes from the true mapping • 55% of them were mapped to nodes with the same geographic location • 27% are completely erroneous
Conclusion • In reality anonymized graphs are released with some attributes => de-anonymization is even easier! • Social networks grow • overlap between social networks grows • Auxiliary info is much richer Anonymity is not sufficient for privacy when dealing with social networks!
Questions and Discussion Thank You!
Bibliography • Arvind Narayanan and Vitaly Shmatikov De-anonymizing Social Networks • Lars Backstrom, Cynthia Dwork and Jon M. Kleinberg Wherefore art thou r3579x?: Anonymized social networks, hidden patterns, and structural steganography