Large -Scale Entity-Based Online Social Network Profile Linkage

Large-Scale Entity-Based Online Social Network Profile Linkage

Background & Motivation • Foot prints in different social networks. • User identification in social analysis. • Privacy & security • Commercial & government applications

Outline • Problem definition • Related work • Approach • Experiment • Conclusion & future work

Problem Definition • Terminology • Identity: Person • Profile/User: Your footprint on social media • Profile Linkage: Link your footprints together • Input & Output • Input: profiles of one site as QUERY and profiles of the other site as TARGET. • Output: all pairs of classified matched profiles.

Characteristics of profile • Name (semi vs. structured) • {“given name”: “haochen”, “family name”: “zhang”} • name: zhanghaochen • Semi-structured schema • Incompleteness & missing attributes • Privacy policy • Virtual identification • Free text description • Bio, About me, Tags • Multilingualism

Multilingualism • Top 5 languages in dataset of Facebook • English • Portuguese • Spanish • Chinese • French • Most frequent tokens in different languages • chris, john, michael • chen, wang, lee • carlos, garcia, daniel • sergey, olga, alexander • About 70% users are in English • 7.2% users register as different locales • Transliteration • 昊辰 => Haochen

Related work • String similarity metrics • name matching (Jaro Winkler, Edit distance) • VSM (TF-IDF) • Pair-wise comparison • schema matching for different prototypes • unsupervised vs. supervised • Indexing technique • blocking • canopy

Overview of approach

Entity-based representation • Extraction • Attribute => Entity • Named-entity recognition • Language detector • Regular expression • Involved entities • Username • Name • Location • Organization • URL • Language • Country • Gender • Birth • Tokenization • General representation • Microsoft Word Breaker • Preparation for canopy

Canopy: design

Canopy: efficiency

Classification: features • String similarity (Jaro Winkler Similarity) • Username, Name • Token similarity (n-gram, IDF) • Username, Name, Location, URL, Organization • All tokens • Enumeration identity • Language, Country

Classification: learning • Supervised Learning • SVM • Naïve Bayes • C4.5 • AdaBoost with C4.5 • Problems • Imbalance between positive instances and negative instances.

Dataset of experiment • Data source • Google+ • Twitter • Facebook • 20000+ profiles for each social network and 10000+ matched pairs

Experiment on artificial dataset • Balanced dataset with equal amount of positive instances and randomly selected negative instances, which is quite different between matched links and unmatched links. • Name features are most important features while largely mirror each other • Country extracted from URL may be deceptive.

Experiment on overall dataset • Imbalanced dataset and ratio of POS and NEG could be 1:100 even after pruning with canopy. • More similar pending pairs hurt performance. • Failed pruning and excessively relying on name features hurts recall.

Parameter tuning • Greater threshold brings more candidates that interference classifier. • Less threshold prunes more matched links by mistake.

Efficiency

Conclusion • We have investigated characteristics of social network user profiles. • We proposed an supervised approach with canopy to solve large-scale profile linkage task. • The approach is proved to be both effective and efficient, while the run-time and complexity can be controlled.

Future work • Investigate deeper in characteristics of profiles in different locale. • Improving learning techniques. • Automatic semi-structured profile comparison or schema mapping. • Improving approach for web people search task.

Theta vs. Corpus size

Web People Search • Search from search engine. • query by username or tokens • 3 * 2 * 8 * 2 = 96 queries, 675 candidates, 63 ground truth • Evaluation (Training classifier with overall training set) • SVM: P=0.30, R=0.77, F1=0.43 and Accuracy=0.81 • AdaBoosted C4.5: P=0.11, R=0.96, F1=0.21 and Accuracy=0.31 • Too similar username or name to correctly classify unmatched instance.

Thank you

Large -Scale Entity-Based Online Social Network Profile Linkage

Large -Scale Entity-Based Online Social Network Profile Linkage

Presentation Transcript

Large Scale Navigation Based on Perception

Time-Based Management of Large Scale Network Models

Complete Network Analysis Network Connections: Large-Scale network structure

3D Gesture-Based View Manipulator for Large Scale Entity Model Review

Large -Scale Cost-sensitive Online Social Network Profile Linkage

Towards large-scale, open-domain and ontology-based named entity classification

Large-Scale Collective Entity Matching

Large-Scale Social Network Analysis – The STACC Experience

OverSoc Social Profile Based Overlays

Large-Scale Network Dynamics: A New Frontier

Matchbox Large Scale Online Bayesian Recommendations

LARGE SCALE

Defending against large-scale crawls in online social networks

Large-Scale Content-Based Image Retrieval

A Large-Scale Network Testbed

The Large Scale Clinical Trial Network

Large – Scale Sensor network

Large scale

Service Deployment in Large Scale Network

HiMap: Adaptive Visualization of Large-Scale Online Social Networks

NetSearch : Googling Large-scale Network Management Data

Large Scale Network Growth Techniques