ArnetMiner – Extraction and Mining of Academic Social Networks

ArnetMiner– Extraction and Mining of Academic Social Networks 1Jie Tang, 1Jing Zhang, 1Limin Yao, 1Juanzi Li, 2Li Zhang, and 2Zhong Su 1Knowledge Engineering Group, Dept. of Computer Science and Technology Tsinghua University 2IBM, China Research Lab August 25th 2008

Motivation “The information need is not only about publication…” “Academic search is treated as document search, but ignore semantics” However, we are not satisfactory with …

Examples – Expertise search • When starting a • work in a new research topic; • Or brainstorming for novel ideas. • Who are experts in this field? • What are the top conferences in the field? • What are the best papers? • What are the top research labs? Researcher A ?

Examples – Citation network analysis • an in-depth understanding of the research field? ? Researcher B

Examples – Conference Suggestion authors Which conference should we submit the paper? ? Researcher C content

Examples – Reviewer Suggestion Who are best matching reviewers for each paper? KDD Committee conference ? Paper content

Topic Browser

Academic Network Extraction in ArnetMiner = Researcher Profiling + Name Disambiguation

Motivating Example 2 Contact Information 1 Educational history 1 Academic services Publications 1 2 UIUC coauthor affiliation 1 Ruud Bolle Publication #3 Professor 2 2 position Publication #5 coauthor

Motivating Example 2 Contact Information 1 • Two key issues: • How to accurately extract the researcher profile information from the Web? • How to integrate the information from different sources? Educational history 1 Academic services Publications 1 2 UIUC coauthor affiliation 1 Ruud Bolle Publication #3 Professor 2 2 position Publication #5 coauthor

Researcher Network Extraction 70.60% of the researchers have at least one homepage/introducing page 85.6% from universities 14.4% from companies 71.9% are homepages 28.1% are introducing pages 40% are in lists and tables 60% are natural language text There are a large number of person names having the ambiguity problem 300 most common male names are used by 1 billion+ people (78.74%) in USA Even 3 “Yi Li” graduated from the author’s lab 70% moved at least one time

Our Approach Picture – based on Markov Random Field Markov Property: Special cases: - Conditional Random Fields - Hidden Markov Random Fields Researcher Profiling Name Disambiguation

CRFs - Green nodes are hidden vars, - Purple nodes are observations ADR ADR … … … … AFF AFF AFF AFF AFF AFF POS POS POS POS POS POS OTH OTH OTH OTH OTH OTH Professor is UIUC a at He

Token Definitions

Feature Definition • Content features Standard Word Word features Whether the current token is a word Whether the word is capitalized Morphological features Image Token Image size The size of the image The value of height/width. The value of a person photo is often larger than 1 Image height/width ratio Image format JPG or BMP The number of the “unique color” used in the image and the number of bits used for per pixel, i.e. 32,24,16,8,1 Image color Whether the current image contains a person face Face recognition Whether the filename contains (partially) the researcher name Image filename Whether the “alt” of the image contains (partially) the researcher name Image “ALT” Image positive keywords “myself”, “biology” Image negative keywords “ads”, “banner”, “logo”

Profiling Experiments • Dataset • 1,000 researchers from ArnetMiner.org • Baseline • Amilcare • Support Vector Machines • Unified_NT (CRFs without transition features) • Evaluation measures • Precision, Recall, F1

Profiling Results—5-fold cross validation 83.37

Name Disambiguation Proposal of a semi-supervised framework

Our Method to Name Disambiguation • A hidden Markov Random Field model • Observable Variables X represent publications • Hidden Variables Y represent the labels of publications • Paper relationships define the dependencies over hidden variables

Objective Function maximize 2 1 2 minimize

Relationship Definition

Parameterized Distance Function • We define the distance function as follows (Basu, 04): where • We can see that actually maps each vector xi into another new space, i.e. A1/2xi • To simplify our question, we define A as a diagonal matrix

EM Framework • Initialization • use constraints to generate initial k clusters • E-Step • M-Step • Update cluster centroid • Update parameter matrix A

Disambiguation Experiments • Data set:

Our Approach vs. Baseline

Contribution of Relationships

Distribution Analysis (1) All methods can achieve good performance (2) Our method can achieve good performance (3) Our method can obtain not bad results, but still need further improvements

Modeling the Academic Network and Applications

The Academic Network • Academic Network • Heterogeneous objects: • Paper, Person, Conf./Journal • Relationships: • Conf./Journal publish paper • Paper cite paper • Person write paper • Person is PC memberof Conf./Journal • Person is coauthor of person • Challenges: • How to model the heterogeneous objects in a unified approach? • - How to apply the modeling approach to different applications?

Modeling the Academic Network words authors Topic conference ACT1 ACT2 ACT3

Generative Story of ACT1 Model Generative process Paper Latent Dirichlet Co-clustering Shafiei and Milios We present a generative model for clustering documents and terms. Our model is a four hierarchical bayesian model. We present efficient inference techniques based on Markow Chain Monte Carlo. We report results in document modeling, document and terms clustering … NLP ICDM 0.23 KDD 0.19 …. P(c|z) IR NIPS ICDM mining 0.23 clustering 0.19 classification 0.17 …. P(w|z) ML DM clustering Shafiei inference NLP ICML 0.23 NIPS 0.19 …. P(c|z) IR DM model 0.23 learning 0.19 boost 0.17 …. ML P(w|z) Milios

ACT Model 1 Generative process: words authors Topic conference ACT1

ACT Model 2 Generative process: authors words conference ACT2

ACT Model 3 Generative process: authors words conference ACT3

Applications Association search Expertise search Researcher interests Hot topic on a conference Topic browser

Expertise Search • Calculate the relevance of query q and different objects (i.e., papers, authors, and conferences) • E.g.,

Expertise Search Results Arnetminer data: 14,134 authors 10,716 papers 1,434 confs/journals Evaluation measures: pooled relevance + human judgement Baselines: - Language Model (LM) - LDA - Author Topic (AT)

ArnetMiner Today

ArnetMiner Today * Arnetminer data: > 0.5 M researcher profiles > 2M papers > 8M citation relationships > 4K conferences * Visits come from more than 165 countries * Continuously +20% increase of visits per month * Currently, more than 1,500 unique-ip visits per day. Top 10 countries 1. USA 6. Canada 2. China 7. Japan 3. Germany 8. France 4. India 9. Taiwan 5. UK 10. Italy

Person Search Basic Info. Research Interests Social Network Publications

ExpertiseSearch Finding experts, expertise conferences, and expertise papers for “data mining”

Association Search Finding associations between persons - high efficiency - Top-K associations Usage: - to find a partner - to find a person with same interests

Survey Paper Finding Survey papers

Topic Browser 200 topics have been discovered automatically from the academic network

Acknowledgements • National Science Foundation of China (NSFC) • National 985 Funding • Chinese Young Faculty Research Funding • Minnesota-China Collaboration Project • IBM CRL • Tsinghua-Google Joint Research Project • National Foundation Science Research (973)

Thanks! Q&A & Demo HP: http://keg.cs.tsinghua.edu.cn/persons/tj/ Online URL: http://arnetminer.org If want to know more technique details, please come to our poster session tomorrow night.

ArnetMiner – Extraction and Mining of Academic Social Networks