230 likes | 373 Views
Large -Scale Entity-Based Online Social Network Profile Linkage. Background & Motivation. Foot prints in different social networks. User identification in social analysis. Privacy & security Commercial & government applications. Outline. Problem definition Related work Approach
E N D
Large-Scale Entity-Based Online Social Network Profile Linkage
Background & Motivation • Foot prints in different social networks. • User identification in social analysis. • Privacy & security • Commercial & government applications
Outline • Problem definition • Related work • Approach • Experiment • Conclusion & future work
Problem Definition • Terminology • Identity: Person • Profile/User: Your footprint on social media • Profile Linkage: Link your footprints together • Input & Output • Input: profiles of one site as QUERY and profiles of the other site as TARGET. • Output: all pairs of classified matched profiles.
Characteristics of profile • Name (semi vs. structured) • {“given name”: “haochen”, “family name”: “zhang”} • name: zhanghaochen • Semi-structured schema • Incompleteness & missing attributes • Privacy policy • Virtual identification • Free text description • Bio, About me, Tags • Multilingualism
Multilingualism • Top 5 languages in dataset of Facebook • English • Portuguese • Spanish • Chinese • French • Most frequent tokens in different languages • chris, john, michael • chen, wang, lee • carlos, garcia, daniel • sergey, olga, alexander • About 70% users are in English • 7.2% users register as different locales • Transliteration • 昊辰 => Haochen
Related work • String similarity metrics • name matching (Jaro Winkler, Edit distance) • VSM (TF-IDF) • Pair-wise comparison • schema matching for different prototypes • unsupervised vs. supervised • Indexing technique • blocking • canopy
Entity-based representation • Extraction • Attribute => Entity • Named-entity recognition • Language detector • Regular expression • Involved entities • Username • Name • Location • Organization • URL • Language • Country • Gender • Birth • Tokenization • General representation • Microsoft Word Breaker • Preparation for canopy
Classification: features • String similarity (Jaro Winkler Similarity) • Username, Name • Token similarity (n-gram, IDF) • Username, Name, Location, URL, Organization • All tokens • Enumeration identity • Language, Country
Classification: learning • Supervised Learning • SVM • Naïve Bayes • C4.5 • AdaBoost with C4.5 • Problems • Imbalance between positive instances and negative instances.
Dataset of experiment • Data source • Google+ • Twitter • Facebook • 20000+ profiles for each social network and 10000+ matched pairs
Experiment on artificial dataset • Balanced dataset with equal amount of positive instances and randomly selected negative instances, which is quite different between matched links and unmatched links. • Name features are most important features while largely mirror each other • Country extracted from URL may be deceptive.
Experiment on overall dataset • Imbalanced dataset and ratio of POS and NEG could be 1:100 even after pruning with canopy. • More similar pending pairs hurt performance. • Failed pruning and excessively relying on name features hurts recall.
Parameter tuning • Greater threshold brings more candidates that interference classifier. • Less threshold prunes more matched links by mistake.
Conclusion • We have investigated characteristics of social network user profiles. • We proposed an supervised approach with canopy to solve large-scale profile linkage task. • The approach is proved to be both effective and efficient, while the run-time and complexity can be controlled.
Future work • Investigate deeper in characteristics of profiles in different locale. • Improving learning techniques. • Automatic semi-structured profile comparison or schema mapping. • Improving approach for web people search task.
Web People Search • Search from search engine. • query by username or tokens • 3 * 2 * 8 * 2 = 96 queries, 675 candidates, 63 ground truth • Evaluation (Training classifier with overall training set) • SVM: P=0.30, R=0.77, F1=0.43 and Accuracy=0.81 • AdaBoosted C4.5: P=0.11, R=0.96, F1=0.21 and Accuracy=0.31 • Too similar username or name to correctly classify unmatched instance.