170 likes | 302 Views
Connecting Users across Social Media Sites: A Behavioral-Modeling Approach. Reza Zafarani and Huan Liu Data Mining and Machine Learning Laboratory (DMML) Arizona State University KDD 2013 – Chicago, Illinois. How hard can it be to identify an individual across sites?
E N D
Connecting Users across Social Media Sites:A Behavioral-Modeling Approach Reza Zafarani and Huan Liu Data Mining and Machine Learning Laboratory (DMML) Arizona State University KDD 2013 – Chicago, Illinois
How hard can it be to identify an individual across sites? Privacy Experts Claim Advertisers Know a lot about People Can they stop showing you the same repetitive ads across sites?
Huan Liu More information about individuals Many social media sites Partial Information Complementary Information Better User Profiles Connectivity is not available Consistency in Information Availability Can we connect individuals across sites?
Can we verify that the information provided across sites belong to the same individual?
Human behavior generates Information redundancy Information shared across sites provides a behavioral fingerprint • Behavioral Modeling • Minimum Information MOBIUS MOdelingBehavior for Identifying Users across Sites
Identification Function Minimuminformation available on ALL sites: Usernames Prior Usernames ({jsmith, john.s}) Candidate Username (john.smith)
Generates Captured Via Feature Set 1 Information Redundancy Behavior 1 Feature Set 2 Information Redundancy Behavior 2 Feature Set n Information Redundancy Behavior n Learning Framework Identification Function Data
Time and Memory Limitation 59% of individuals use the same username
Knowledge Limitation Identifying individuals by their vocabulary size Alphabet Size is correlated to language: शमंतकुमार -> Shamanth Kumar
Typing Patterns QWER1234 AOEUISNTH QWERTY Keyboard Variants: AZERTY, QWERTZ DVORAK Keyboard Keyboard type impacts your usernames
Habits - old habits die hard Adding Prefixes/Suffixes, Abbreviating, Swapping or Adding/Removing Characters Nametag and Gateman Usernames come from a language model
Experiment Setup Previous Methods: • Zafarani and Liu, 2009 • Perito et al., 2011 Baselines: • Exact Username Match • Substring Match • Patterns in Letters Data: 200,000 instances (50% class balance) 414 Features
Diminishing Returns for Adding More Usernames
Conclusions +Future Work Information shared across sites acts as a behavioral fingerprint Discover applications of connecting users across sites Human Behavior Results in Information Redundancy A methodology for connecting individuals across sites • A behavioral modeling approach • Uses minimum information across sites • Allows for integration of additional behaviors when required Incorporating features indigenous to specific sites