De-anonymizing Social Networks

Presenter: Lijie Zhang Advisor: Weining Zhang

  1. De-anonymizing Social Networks Presenter: Lijie Zhang Advisor: Weining Zhang

  2. Outlines • Motivation • Attack Model • De-anonymization Algorithm • Experiments • Conclusions

  3. Motivation • Social network (SN) owner publishes graph data for sharing • Academic and government data-mining: phone call networks • Advertising: • Third-party applications: 550,000 Facebook applications • Private information on SNs: • Node attributes: node degree in a sexual network • Edge presence: a single call, romantic relationship

  4. Motivation • SN owner publishes anonymized graph: • Nodes have no identifying attributes • Propose a model to identify nodes from the anonymized graph: • Re-identification: learn the entity to which the node belongs to. • Entity: an account, a real person, a group, an organization

  5. Outlines • Motivation • Attack Model • De-anonymization Algorithm • Experiments • Conclusions

  6. Model – Social Network • Social Network S: • A directed graph G=(V,E) • A set of node attributes X: name, telephone number • A set of edge attributes Y: type of relationship • Treat attributes values from a discrete domain

  7. Model – Data Release • A sanitized subset of nodes and edges in S • Computation: • Vsan: subset of V • Xsan: subset of X including sensitive attributes • Ysan: subset of Y including sensitive attributes • Published attributes by themselves are insufficient for re-identification • Compute induced subgraph on Vsan • Remove some edges and add faked edges

  8. Model – Attacker • Purpose: extract sensitive information about specific individuals from anonymized SN graphs • Attacker’s knowledge • Aggregate auxiliary information • Individual auxiliary information

  9. Aggregate auxiliary information • Large-scale information from other data sources and social networks whose membership overlaps with the target network Ssan • Gaux={Vaux, Eaux} • AuxXand AuxY: probability distributions of each node attribute in Vaux and edge attribute in Eaux, respectively (prior knowledge).

  10. Individual auxiliary information • Identifiable details about a small number of individuals from the target network Ssan and possibly relationships between them

  11. Model – Breaching Privacy • Extract sensitive information about specific individuals from Ssan • Re-identify nodes from target SN Ssan • Re-identification: find a mapping μbetween a node in Vaux and a node in Vsan • : ground truth mapping • Succeeds if

  12. Model – Breaching Privacy • Re-identification algorithm: • Input: Ssan and Saux • Output • is the probability that vaux maps to vsan • Mapping adversary:

  13. Model – Breaching Privacy • Privacy breach: privacy of vsan is breached w.r.t adversary Adv and privacy parameter , if

  14. Model – Measuring Success of an Attack • Let . The success rate of a de-anonymization algorithm outputting a probabilistic mapping , w.r.t a centrality measure , is the probability that μsampled from maps a node v to if v is selected according to

  15. Outlines • Motivation • Attack Model • De-anonymization Algorithm • Experiments • Conclusions

  16. De-anonymization Algorithm • Seed identification: apply individual auxiliary information • Propagation: apply aggregate auxiliary information

  17. Algorithm - Seed Identification • Input: • The target graph • A clique of k nodes which are present both in the auxiliary and the target graphs. • The degree values of k nodes • pairs of common-neighbor counts • Error parameter ε • Output : k-clique with matching ( ) node degrees and common-neighbor counts.

  18. Algorithm - Propagation • Inputs: G1, G2, • Output: μ • Iteratively find new mappings using the topological structure of the network and the feedback from previously constructed mappings.

  19. Algorithm - Propagation function propagationStep(lgraph, rgraph, mapping) for lnode in lgraph.nodes: scores[lnode] = matchScores(lgraph, rgraph, mapping, lnode) if eccentricity(scores[lnode]) < theta: continue rnode = (pick node from rgraph.nodes where scores[lnode][node] = max(scores[lnode])) scores[rnode] = matchScores(rgraph, lgraph, invert(mapping), rnode) if eccentricity(scores[rnode]) < theta: continue reverse_match = (pick node from lgraph.nodes where scores[rnode][node] = max(scores[rnode])) if reverse_match != lnode: continue mapping[lnode] = rnode

  20. Algorithm - Propagation • Eccentricity: measure how much a node in a graph “stands out” from the rest nodes. • Rejects the match if eccentricity of the set of mapping scores is below a threshold,

  21. Algorithm - Propagation • Complexity: O((|E1|+|E2|)d1d2) • d1 : a bound on the degree of the nodes in V1

  22. Outlines • Motivation • Attack Model • De-anonymization Algorithm • Experiments • Conclusions

  23. Experiments – Data Sets • Twitter, Flickr, LiveJournal:

  24. Experiments – Seed Identification • Evaluate the feasibility of seed identification by measuring how much auxiliary information is needed to identify a unique node in the target graph. • LiveJournal graph: auxiliary and target • Construct 4-cliques, and treat a 4-clique in the target graph as a match as long as each degree and common-neighbor count matches within a factor of

  25. Experiments – Seed Identification

  26. Experiments – Propagation • Evaluate the robustness against perturbation and seed selection • Pairs of subgraphs (V1,V2), over 100,000 nodes each of a real-world SN • One for auxiliary SN, the other as the target SN • Perturbation strategy: two subgraphs has nodes overlapped 25% and edges overlapped 50%

  27. Evaluate the robustness against perturbation and seed selection

  28. Experiments – Propagation • Mapping between two real-world social networks: Flickr and Twitter • Finding ground truth : • Exact matches in either the username, or name field • 27,000 mappings • Human inspect ground truth error that is under 5%.

  29. Mapping between two real-world social networks • Seeds: 150 pairs of nodes selected from • Results: • 30.8% of the mappings were re-identified correctly, 12.1% were identified incorrectly, and 57% were not identified. • 41% of the incorrectly identified mappings (5% overall) were mapped to nodes which are at a distance 1 from the true mapping. • 55% of the incorrectly identified mappings (6.7% overall) were mapped to nodes where the same geographic location was reported. • The above two categories overlap; of all the incorrect mappings, only 27% (or 3.3% overall) fall into neither category and are completely erroneous.

  30. Conclusions • Anonymity is not sufficient for privacy when dealing with social networks. • Demonstrate feasibility of successful re-identification based solely on the network topology and assuming that the target graph is completely anonymized.

  31. Reference • [1]  Arvind Narayanan and Vitaly Shmatikov, “De-anonymizing Social Networks”, IEEE Security & Privacy '09.

