Social Network Extraction of Academic Researchers

Social Network Extraction of Academic Researchers Jie Tang, Duo Zhang, and Limin Yao Tsinghua University Oct. 29th 2007

Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary

Motivation • More and more online social networks become available • e.g., YouTube.com, Facebook.com, etc. • However, the social networks are usually separated • A question arises: can we build a integrated social network from the separated ones automatically? • As a case study, how to build an social network automatically for academic community? • ArnetMiner.org

Motivating Example 2 Contact Information 1 Educational history 1 Academic services Publications 1 2 UIUC coauthor affiliation 1 Ruud Bolle Publication #3 Professor 2 2 position Publication #5 coauthor

Motivating Example 2 Contact Information 1 • Two key issues: • How to accurately extract the researcher profile information from the Web? • How to integrate the information from different sources? Educational history 1 Academic services Publications 1 2 UIUC coauthor affiliation 1 Ruud Bolle Publication #3 Professor 2 2 position Publication #5 coauthor

Related Work – Person Profiling • Profile Information Extraction • E.g., Yu et al. (2005), resume IE • Alani et al. (2003), Artequakt system • Contact Information Extraction • E.g., Kristjansson et al. (2004), Interactive extraction • Balog and Rijke (2006), Heuristic rules • Information Extraction Methods • E.g., HMM (Ghahramani, 1997), • MEMM (McCallum, 2000), • CRFs (Lafferty, 2001)

Related Work – Name Disambiguation • Unsupervised Methods • Hierarchy clustering, K-way spectral clustering, etc. • E.g. Han (2005), Mann (2003), Tan (2006) • Supervised Methods • Support Vector Machines, Naïve Bayes, etc. • E.g. Han (2004) • Graph-based Approach • Random Walk, etc. • E.g. Bekkerman (2005), Malin (2005), Minkov (2006)

Researcher Social Network Extraction 70.60% of the researchers have at least one homepage or an introducing page 85.6% from universities 14.4% from companies 71.9% are homepages 28.1% are introducing pages 40% are in lists and tables 60% are natural language text There are a large number of person names having the ambiguity problem Even 3 “Yi Li” graduated the author’s lab 70% moved at least one time

Markov Random Field Special Cases: - Conditional Random Fields - Hidden Markov Random Fields Markov Property:

CRFs - Green nodes are hidden vars, - Purple nodes are observations ADR ADR … … … … AFF AFF AFF AFF AFF AFF POS POS POS POS POS POS OTH OTH OTH OTH OTH OTH Professor is UIUC a at He

Processing Flow for Profiling 1 Preprocessing 2 Tagging Train Standard word Determine Tokens Special word Assigning tags Image Token Term Inputted docs Document Punc. mark Labeling data Test Model Learning 3 Feature definitions Learning a CRF model A unified tagging model Tagging results Labeled data

Token Definitions Standard word Words in natural language Special word Including several general ‘special words’ e.g. email address, IP address, URL, date, number, money, percentage, unnecessary tokens (e.g. ‘===’ and ‘###’), etc. Image token <IMAGE src="defaul3.jpg" alt=""/> base NP, like “Computer Science” Term Punctuation marks Including period, question mark, and exclamation mark

Possible Tag Assignment

Feature Definition • Content features Standard Word Word features Whether the current token is a word Whether the word is capitalized Morphological features Image Token Image size The size of the image The value of height/width. The value of a person photo is often larger than 1 Image height/width ratio Image format JPG or BMP The number of the “unique color” used in the image and the number of bits used for per pixel, i.e. 32,24,16,8,1 Image color Whether the current image contains a person face Face recognition Whether the filename contains (partially) the researcher name Image filename Whether the “alt” of the image contains (partially) the researcher name Image “ALT” Image positive keywords “myself”, “biology” Image negative keywords “ads”, “banner”, “logo”

Feature Definition • Pattern features • Term features

Our Method to Name Disambiguation • A hidden Markov Random Field model • Observable Variables X represent publications • Hidden Variables Y represent the labels of publications • Constraints define the dependencies over hidden variables

Objective Function maximize 2 1 2 minimize

Constraint Definition

Parameterized Distance Function • We define our distance function as follows: where • We can see that actually maps each vector xi into another new space, i.e. A1/2xi • To simplify our question, we define A as a diagonal matrix

EM Framework • Initialization • use constraints to generate initial k clusters • E-Step • M-Step • Update cluster centroid • Update parameter matrix A

Profiling Experiments • Dataset • IK researchers from ArnetMiner.org • Baseline • Amilcare • Support Vector Machines • Unified_NT (CRFs without transition features) • Evaluation measures • Precision, Recall, F1

Profiling Results—5-fold cross validation 83.37

Contribution of Features

Disambiguation Experiments • Data Sets: Abbreviated Name dataset Real Name dataset

Experiment Setup • Baseline Method Unsupervised Hierarchical Clustering Method • Measurement

Disambiguation Results

Contribution of Different Constraint

How Profiling and Disambiguation Help Expert Finding • Expert finding by using a PageRank-based method

Summary • Investigated the problem of researcher social network extraction • Proposed a unified approach to perform profiling and a constraint-based probabilistic model to name disambiguation • Experimental results show that our approaches outperform the baseline methods • When applying it to expert finding, we obtain a significant improvement on performances

Thanks! Q&A HP: http://keg.cs.tsinghua.edu.cn/persons/tj/ Online Demo: http://arnetminer.org

Social Network Extraction of Academic Researchers

Social Network Extraction of Academic Researchers

Presentation Transcript

POLYPHONET: An Advanced Social Network Extraction System from the Web

Global Academic Network for Doctoral Researchers in Operations Management

ArnetMiner – Extraction and Mining of Academic Social Networks

Academic Network - retrospective

Causal-Association Network Extraction

Social Network

ArnetMiner: Extraction and Mining of Academic Social Networks

Citrix Academic Network

Challenges for academic researchers

Social Science Researchers’ Careers

Researchers’ Social Space

Social network

The behavior of academic researchers using an Institutional Repository

Information Extraction, Conditional Random Fields, and Social Network Analysis

Academic Researchers

ArnetMiner – Extraction and Mining of Academic Social Networks

Information Extraction, Conditional Random Fields, and Social Network Analysis

Academic Support Network