430 likes | 441 Views
De-anonymizing Social Networks. Arvind Naryanan Vitaly Shmatikov Presented By Bill Hanczaryk. The Goal. Demonstrate the feasibility of a large-scale, passive de-anonymization of real world social networks. Introduce social networks Review current state of data sharing in social networks
E N D
De-anonymizing Social Networks Arvind Naryanan Vitaly Shmatikov Presented By Bill Hanczaryk
The Goal • Demonstrate the feasibility of a large-scale, passive de-anonymization of real world social networks. • Introduce social networks • Review current state of data sharing in social networks • Formally define PRIVACY and ANONYMITY • Develop a re-identification algorithm for anonymized social networks • Conduct a concrete expirement of how the de-anonymization algorithm works.
Social Networks • 47% of online adults use social networking sites • 73% of teens and young adults are a member of at least one social network • Facebook: • More than 1.5 million local businesses have active Pages on Facebook. • The average user spends more than 55 minutes per day on Facebook . • Facebook has 400+ million active users, with over 1.5 million business pages.
Social Networks Twitter:Twitter has 24+ million unique visitors per month, with 500 million tweets per day.11 Percent (or 33.88m) of US Online Adults Use TwitterThere are approximately 50 million Tweets sent per day, at about 600 tweets per second.Unique visitors per month, as of February 2010Facebook: 133,623,529 MySpace: 50,615,444Twitter: 23,573,178Linkedin: 15,475,890
Social Networks Huge disparity between what members of a social network think is private and what is actually private. Business model of social networks is founded upon the sharing of member information. To alleviate some of the privacy concerns, information is anonymized when given to a 3rd party. Anonymized = names, demographics, and other identifiable attributes are suppressed (but not removed)
Social Networks Anonymity != Privacy Combining anonymized social-network data with some auxiliary information can result in a breach of privacy and a de-anonymization of a network. This is a problem and the practices of sharing social-network data needs to be seriously re-evaluated.
How is Data Shared? • To WHO: • Academic and Government data-mining • Public but must be requested and documented • Advertising • Third Party Applications
How is Data Shared? • Other Scenarios: • Aggregation • Combining data from multiple social networks • P2P networks (not anonymzied at the network level) • Think photographs and facial recognition software
Previous Work • Privacy Properties • Social network = nodes, edges (relationships between nodes), and information associated with each node and each edge • Information about nodes obviously wants to satisfy a level of privacy • Most social networks make relationships between nodes public by default (few users change)
Previous Work Edges of social networks may also have attributes with sensitive information:
De-anonymizing Attacks • What has been done before: • Active Attacks • Attempting to create new edges to targeted users • From a faux node called a “sybil” • Creates a pattern of links among new accounts with the goal of making it stand out in an anonymized graph
De-anonymizing Attacks • Problem with Active attacks: • Not feasible to stage on a large scale level • Restricted to online social networks • Creating, verifying, and maintaining thousands of fake nodes is either too expensive or impossible. • No control over the edges coming in to the sybil nodes. • Many OSN sites require a mutual link before any information is shared. • Active attack can still be useful in identifying a starting point to conduct a passive attack (later).
Defenses to protect Privacy • Previous works: • Encrypting profiles when they are sent to the server and client-side decrypting when a profile is viewed. • Problems • Constant reverse-engineering • Severely cripples functionality as almost any action would require a server to manipulate encrypted data
Defenses to protect Privacy • Previous works: • Anonymity • Randomized tokens represent users rather than actual identifiers such as names or profiles. • Problems • Anonymity mistakenly thought of as privacy • Anonymous graphs can be coupled with auxiliary information to identify user identity
Defenses to protect Privacy • Previous works: • K-anonymity • There must exist automorphisms of the graph that map each of k nodes to one another • Problems • Technique only works when # of relationships is low • Imposes an arbitrary restriction on the amount of auxiliary information an adversary has access to.
Auxiliary Information • Auxiliary information is global in nature • Many social networking sites overlap one another • Facebook, Myspace, Twitter, etc. (correlate) • Can be used for large-scale re-identification • Feedback based attack • Re-identification of some nodes provides the attacker with even more auxiliary information
Model – Social Network • Let us define a social network S consists of • A directed graph G = (V,E) • A set of attributes X for each node in V and a set of attributes Y for each edge in E Attributes for nodes: (i.e. name, telephone #) Attributes for edges: (i.e. type of relationship)
Model - Data Release • As we said before, data is most commonly released to: • Advertisers • Generally given access to the entire graph G in an anonymized form and a limited # of attributes • Application Developers • Generally given access to only a sub-graph based on user opt-in and most or all attributes within • Researchers • May receive the entire graph or sub-graph and a limited set of non-identifying attributes
Data Sanitization Data sanitization is changing the graph structure in some way to make re-identification attacks harder. Most rely on simple removal of identifiers Others inject random noise into the graph As we said with k-anonymization, trying to make different nodes look the same is not realistic. One suggested technique is link prediction
Model - Threat Owners of social networks release anonymized network graphs to commercial partners and researchers alike THE BIG QUESTION: Can sensitive information about specific individuals be extracted from these graphs?
Model - Attacker • 4 main categories of attackers • Global surveillance • Adversary has access to a large auxiliary network with the objective of collecting as much detailed information about as many individuals as possible • Abusive marketing • An unethical adversary that would de-anonymize an anonymized graph to target aggressive advertising to specific individuals
Model - Attacker 3. Phishing and Spamming • Adversary would de-anonymize a user and create a highly individualized believable message phishing for even more sensitive information 4. Targeted de-anonymization • This adversary would target a specific user (includes stalkers, investigators, colleagues, employers, neighbors)
Model - Attacker Assume an attacker has access to an anonymized, sanitized, target network SSAN and also access to a different network SAUX whose members partially overlap with SSAN. This is a very real and plausible assumption Facebook -> Myspace or Twitter -> Flickr Even with an extensive auxiliary network SAUX , de-anonymizing the target network SSAN is extremely difficult.
Aggregate Auxiliary Information • The auxiliary information may include relationships between entities: • Saux as a graph Gaux = {Vaux,Eaux} • And a set of probability distributions AuxX and AuxY , one for each attribute of every node in Vaux and each attribute of every edge in Eaux (The values represent the adversary’s knowledge of the corresponding attribute value)
Individual Auxiliary Information Assume also that the attacker possesses thorough information about a very small number of nodes on the target network SSAN The attacker should be able to identify if those members are also members of his auxiliary network SAUX Question at hand: can this information be used in any way to learn sensitive information about other members of SSAN ?
PRIVACY • How do we determine what a privacy breach is? • Definition of what consists of a privacy breach varies not only from network to network but from individual to individual. • For the sake of the research, an operational approach is taken focusing on node re-identification. (Determining the identity of a node)
Anonymity • We said earlier anonymity != privacy BUT… • Anonymity is still necessary to achieve any sufficient level of privacy • De-anonymizing a network almost gurantees that privacy has been breached • Ground truth – mapping between nodes of VAUX and the nodes of VSAN that represent the same entity • Node re-identification– recognizing correctly that a given node in the anonymized network belongs to the same entity as a node in the attacker’s auxiliary network.
Re-Identification Algorithm Definition 1 A node re-identification algorithm takes as input Ssan and Saux andproduces a probabilistic mapping ˜μ: Vsan × (Vaux ∪ {⊥}) → [0, 1], where ˜μ(vaux, vsan) is the probability that vaux maps to vsan
Mapping Adversary Definition 2 A mapping adversary corresponding to a probabilistic mapping ˜μ outputs a probability distribution calculated as follows:
Measuring Success of an attack Cannot simply measure success of an attack based on number of nodes de-anonymized Many nodes on social networks are inactive or are linked to very few other nodes Instead, a weight is assigned to each affected node in proportion to its importance to the network.
De-anonymization • Two Stages • Seed Identification • attacker identifies a small group of “seed” nodes which are present in both the anonymous target graph and the attacker’s auxiliary graph, and maps them to each other • Propogation • a self-reinforcing process in which the seed mapping is extended to new nodes using only the topology of the network, and the new mapping is fed back to the algorithm. • Result is a huge mapping between subgraphs of the auxiliary and target networks which re-identifies (de-anonymizes) those mapped nodes.
Seed Identification • Inputs: • Target graph • k seed nodes in the auxiliary graph • k degree values • (k 2 )pairs of common-neighbor counts • error parameter ε • Algorithm • Searches the target graph for unique k-clique with matching node degrees and common neighbor counts • If found • Maps nodes in clique to corresponding auxiliary graph nodes • Else • Failure is reported • Brute Force with exponential run-time • Turns out to be not that much of a problem
Propagation • Inputs • Two graphs: G1 = (V1,E1) and G2 = (V2,E2) and a partial seed mapping between the two • Algorithm • Finds new mappings using the topological structure of the network and the feedback from previously constructed mappings • After each iteration the algorithm starts with the accumulated list of mapped pairs between V1 and V2 • Pick an arbitrary unmapped node u in V1 and compare to unmapped node v in V2. • If the # of neighbors of u that have been mapped to the # of neighbors in v is sufficiently within the error margin • A mapping between u and v is added to the list • Process repeated iteratively as more nodes get mapped
Actual Experiment Data from 3 social networking sites
Actual Experiment • Seed Generation • Live Generation as the Target Network • Seed Generation is shown feasible in practice by using only degrees of nodes and common-neighbor counts
Actual Experiment • Propagation
Actual Experiment • Propagation
Actual Experiment • Propagation
Results • De-anonymization success defined as either exact of username or name field. • 30.8% mapped correctly • Only 12.1% mapped incorrectly • 41% of those mapped incorrectly were within distance 1 of the true mapping • 55% of those mapped incorrectly were still mapped to a node in the same geographic location
Summary and Conclusion Anonymity != Privacy in Social Networks Anonymized graphs sold to commercial partners can be de-anonymized using auxiliary information to reveal sensitive information Clearly either the business practices of data sharing of Social Networks, security of the information, or public awareness needs a major overhaul