490 likes | 607 Views
ArnetMiner – Extraction and Mining of Academic Social Networks. 1 Jie Tang, 1 Jing Zhang, 1 Limin Yao, 1 Juanzi Li, 2 Li Zhang, and 2 Zhong Su 1 Knowledge Engineering Group, Dept. of Computer Science and Technology Tsinghua University 2 IBM, China Research Lab August 25 th 2008.
E N D
ArnetMiner– Extraction and Mining of Academic Social Networks 1Jie Tang, 1Jing Zhang, 1Limin Yao, 1Juanzi Li, 2Li Zhang, and 2Zhong Su 1Knowledge Engineering Group, Dept. of Computer Science and Technology Tsinghua University 2IBM, China Research Lab August 25th 2008
Motivation “The information need is not only about publication…” “Academic search is treated as document search, but ignore semantics” However, we are not satisfactory with …
Examples – Expertise search • When starting a • work in a new research topic; • Or brainstorming for novel ideas. • Who are experts in this field? • What are the top conferences in the field? • What are the best papers? • What are the top research labs? Researcher A ?
Examples – Citation network analysis • an in-depth understanding of the research field? ? Researcher B
Examples – Conference Suggestion authors Which conference should we submit the paper? ? Researcher C content
Examples – Reviewer Suggestion Who are best matching reviewers for each paper? KDD Committee conference ? Paper content
2 1
Academic Network Extraction in ArnetMiner = Researcher Profiling + Name Disambiguation
Motivating Example 2 Contact Information 1 Educational history 1 Academic services Publications 1 2 UIUC coauthor affiliation 1 Ruud Bolle Publication #3 Professor 2 2 position Publication #5 coauthor
Motivating Example 2 Contact Information 1 • Two key issues: • How to accurately extract the researcher profile information from the Web? • How to integrate the information from different sources? Educational history 1 Academic services Publications 1 2 UIUC coauthor affiliation 1 Ruud Bolle Publication #3 Professor 2 2 position Publication #5 coauthor
Researcher Network Extraction 70.60% of the researchers have at least one homepage/introducing page 85.6% from universities 14.4% from companies 71.9% are homepages 28.1% are introducing pages 40% are in lists and tables 60% are natural language text There are a large number of person names having the ambiguity problem 300 most common male names are used by 1 billion+ people (78.74%) in USA Even 3 “Yi Li” graduated from the author’s lab 70% moved at least one time
Our Approach Picture – based on Markov Random Field Markov Property: Special cases: - Conditional Random Fields - Hidden Markov Random Fields Researcher Profiling Name Disambiguation
CRFs - Green nodes are hidden vars, - Purple nodes are observations ADR ADR … … … … AFF AFF AFF AFF AFF AFF POS POS POS POS POS POS OTH OTH OTH OTH OTH OTH Professor is UIUC a at He
Feature Definition • Content features Standard Word Word features Whether the current token is a word Whether the word is capitalized Morphological features Image Token Image size The size of the image The value of height/width. The value of a person photo is often larger than 1 Image height/width ratio Image format JPG or BMP The number of the “unique color” used in the image and the number of bits used for per pixel, i.e. 32,24,16,8,1 Image color Whether the current image contains a person face Face recognition Whether the filename contains (partially) the researcher name Image filename Whether the “alt” of the image contains (partially) the researcher name Image “ALT” Image positive keywords “myself”, “biology” Image negative keywords “ads”, “banner”, “logo”
Profiling Experiments • Dataset • 1,000 researchers from ArnetMiner.org • Baseline • Amilcare • Support Vector Machines • Unified_NT (CRFs without transition features) • Evaluation measures • Precision, Recall, F1
Name Disambiguation Proposal of a semi-supervised framework
Our Method to Name Disambiguation • A hidden Markov Random Field model • Observable Variables X represent publications • Hidden Variables Y represent the labels of publications • Paper relationships define the dependencies over hidden variables
Objective Function maximize 2 1 2 minimize
Parameterized Distance Function • We define the distance function as follows (Basu, 04): where • We can see that actually maps each vector xi into another new space, i.e. A1/2xi • To simplify our question, we define A as a diagonal matrix
EM Framework • Initialization • use constraints to generate initial k clusters • E-Step • M-Step • Update cluster centroid • Update parameter matrix A
Disambiguation Experiments • Data set:
Distribution Analysis (1) All methods can achieve good performance (2) Our method can achieve good performance (3) Our method can obtain not bad results, but still need further improvements
The Academic Network • Academic Network • Heterogeneous objects: • Paper, Person, Conf./Journal • Relationships: • Conf./Journal publish paper • Paper cite paper • Person write paper • Person is PC memberof Conf./Journal • Person is coauthor of person • Challenges: • How to model the heterogeneous objects in a unified approach? • - How to apply the modeling approach to different applications?
Modeling the Academic Network words authors Topic conference ACT1 ACT2 ACT3
Generative Story of ACT1 Model Generative process Paper Latent Dirichlet Co-clustering Shafiei and Milios We present a generative model for clustering documents and terms. Our model is a four hierarchical bayesian model. We present efficient inference techniques based on Markow Chain Monte Carlo. We report results in document modeling, document and terms clustering … NLP ICDM 0.23 KDD 0.19 …. P(c|z) IR NIPS ICDM mining 0.23 clustering 0.19 classification 0.17 …. P(w|z) ML DM clustering Shafiei inference NLP ICML 0.23 NIPS 0.19 …. P(c|z) IR DM model 0.23 learning 0.19 boost 0.17 …. ML P(w|z) Milios
ACT Model 1 Generative process: words authors Topic conference ACT1
ACT Model 2 Generative process: authors words conference ACT2
ACT Model 3 Generative process: authors words conference ACT3
Applications Association search Expertise search Researcher interests Hot topic on a conference Topic browser
Expertise Search • Calculate the relevance of query q and different objects (i.e., papers, authors, and conferences) • E.g.,
Expertise Search Results Arnetminer data: 14,134 authors 10,716 papers 1,434 confs/journals Evaluation measures: pooled relevance + human judgement Baselines: - Language Model (LM) - LDA - Author Topic (AT)
ArnetMiner Today * Arnetminer data: > 0.5 M researcher profiles > 2M papers > 8M citation relationships > 4K conferences * Visits come from more than 165 countries * Continuously +20% increase of visits per month * Currently, more than 1,500 unique-ip visits per day. Top 10 countries 1. USA 6. Canada 2. China 7. Japan 3. Germany 8. France 4. India 9. Taiwan 5. UK 10. Italy
Person Search Basic Info. Research Interests Social Network Publications
ExpertiseSearch Finding experts, expertise conferences, and expertise papers for “data mining”
Association Search Finding associations between persons - high efficiency - Top-K associations Usage: - to find a partner - to find a person with same interests
Survey Paper Finding Survey papers
Topic Browser 200 topics have been discovered automatically from the academic network
Acknowledgements • National Science Foundation of China (NSFC) • National 985 Funding • Chinese Young Faculty Research Funding • Minnesota-China Collaboration Project • IBM CRL • Tsinghua-Google Joint Research Project • National Foundation Science Research (973)
Thanks! Q&A & Demo HP: http://keg.cs.tsinghua.edu.cn/persons/tj/ Online URL: http://arnetminer.org If want to know more technique details, please come to our poster session tomorrow night.