350 likes | 439 Views
Social Network Extraction of Academic Researchers. Jie Tang, Duo Zhang, and Limin Yao Tsinghua University Oct. 29 th 2007. Outline. Motivation Related Work Problem Description Our Approach Experimental Results Summary. Motivation. More and more online social networks become available
E N D
Social Network Extraction of Academic Researchers Jie Tang, Duo Zhang, and Limin Yao Tsinghua University Oct. 29th 2007
Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary
Motivation • More and more online social networks become available • e.g., YouTube.com, Facebook.com, etc. • However, the social networks are usually separated • A question arises: can we build a integrated social network from the separated ones automatically? • As a case study, how to build an social network automatically for academic community? • ArnetMiner.org
Motivating Example 2 Contact Information 1 Educational history 1 Academic services Publications 1 2 UIUC coauthor affiliation 1 Ruud Bolle Publication #3 Professor 2 2 position Publication #5 coauthor
Motivating Example 2 Contact Information 1 • Two key issues: • How to accurately extract the researcher profile information from the Web? • How to integrate the information from different sources? Educational history 1 Academic services Publications 1 2 UIUC coauthor affiliation 1 Ruud Bolle Publication #3 Professor 2 2 position Publication #5 coauthor
Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary
Related Work – Person Profiling • Profile Information Extraction • E.g., Yu et al. (2005), resume IE • Alani et al. (2003), Artequakt system • Contact Information Extraction • E.g., Kristjansson et al. (2004), Interactive extraction • Balog and Rijke (2006), Heuristic rules • Information Extraction Methods • E.g., HMM (Ghahramani, 1997), • MEMM (McCallum, 2000), • CRFs (Lafferty, 2001)
Related Work – Name Disambiguation • Unsupervised Methods • Hierarchy clustering, K-way spectral clustering, etc. • E.g. Han (2005), Mann (2003), Tan (2006) • Supervised Methods • Support Vector Machines, Naïve Bayes, etc. • E.g. Han (2004) • Graph-based Approach • Random Walk, etc. • E.g. Bekkerman (2005), Malin (2005), Minkov (2006)
Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary
Researcher Social Network Extraction 70.60% of the researchers have at least one homepage or an introducing page 85.6% from universities 14.4% from companies 71.9% are homepages 28.1% are introducing pages 40% are in lists and tables 60% are natural language text There are a large number of person names having the ambiguity problem Even 3 “Yi Li” graduated the author’s lab 70% moved at least one time
Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary
Markov Random Field Special Cases: - Conditional Random Fields - Hidden Markov Random Fields Markov Property:
CRFs - Green nodes are hidden vars, - Purple nodes are observations ADR ADR … … … … AFF AFF AFF AFF AFF AFF POS POS POS POS POS POS OTH OTH OTH OTH OTH OTH Professor is UIUC a at He
Processing Flow for Profiling 1 Preprocessing 2 Tagging Train Standard word Determine Tokens Special word Assigning tags Image Token Term Inputted docs Document Punc. mark Labeling data Test Model Learning 3 Feature definitions Learning a CRF model A unified tagging model Tagging results Labeled data
Token Definitions Standard word Words in natural language Special word Including several general ‘special words’ e.g. email address, IP address, URL, date, number, money, percentage, unnecessary tokens (e.g. ‘===’ and ‘###’), etc. Image token <IMAGE src="defaul3.jpg" alt=""/> base NP, like “Computer Science” Term Punctuation marks Including period, question mark, and exclamation mark
Feature Definition • Content features Standard Word Word features Whether the current token is a word Whether the word is capitalized Morphological features Image Token Image size The size of the image The value of height/width. The value of a person photo is often larger than 1 Image height/width ratio Image format JPG or BMP The number of the “unique color” used in the image and the number of bits used for per pixel, i.e. 32,24,16,8,1 Image color Whether the current image contains a person face Face recognition Whether the filename contains (partially) the researcher name Image filename Whether the “alt” of the image contains (partially) the researcher name Image “ALT” Image positive keywords “myself”, “biology” Image negative keywords “ads”, “banner”, “logo”
Feature Definition • Pattern features • Term features
Our Method to Name Disambiguation • A hidden Markov Random Field model • Observable Variables X represent publications • Hidden Variables Y represent the labels of publications • Constraints define the dependencies over hidden variables
Objective Function maximize 2 1 2 minimize
Parameterized Distance Function • We define our distance function as follows: where • We can see that actually maps each vector xi into another new space, i.e. A1/2xi • To simplify our question, we define A as a diagonal matrix
EM Framework • Initialization • use constraints to generate initial k clusters • E-Step • M-Step • Update cluster centroid • Update parameter matrix A
Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary
Profiling Experiments • Dataset • IK researchers from ArnetMiner.org • Baseline • Amilcare • Support Vector Machines • Unified_NT (CRFs without transition features) • Evaluation measures • Precision, Recall, F1
Disambiguation Experiments • Data Sets: Abbreviated Name dataset Real Name dataset
Experiment Setup • Baseline Method Unsupervised Hierarchical Clustering Method • Measurement
How Profiling and Disambiguation Help Expert Finding • Expert finding by using a PageRank-based method
Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary
Summary • Investigated the problem of researcher social network extraction • Proposed a unified approach to perform profiling and a constraint-based probabilistic model to name disambiguation • Experimental results show that our approaches outperform the baseline methods • When applying it to expert finding, we obtain a significant improvement on performances
Thanks! Q&A HP: http://keg.cs.tsinghua.edu.cn/persons/tj/ Online Demo: http://arnetminer.org