ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

ACML 2010 TutorialWeb People Search: Person Name Disambiguation and Other Problems Hiroshi Nakagawa Introduction  Feature Extraction(Phrase Extraction) Minoru Yoshida Feature Extraction(Information Extraction Approach) End (University of Tokyo)

Contents • Introduction • Feature Extraction • Feature Weighting / Similarity Calculation • Clustering • Evaluation Issues

Introduction • Motivation • Problem Settings • Differences from other problems • History

Motivation A study of the query log of the AllTheWeb and Altavista search sites gives an idea of the relevance of the people search task: 11-17% of the queries were composed of a person name with additional terms and 4% were identified simply as person names: (Artiles+, 2009 WePS2) • Web search for person names：over 10% of all queries • “Same-name” Problem in person name search • When different real-world entities have the same name, the reference from the name to the entity can be ambiguous. • Many different persons having the same name • (e.g.,)John Smith • Persons having the same name as a famous one • (e.g.,) Bill Gates • Difficult to access to the target person With ordinary search engines, it is tough to find Bill Gates who is not a Microsoft founder! Domination!

Problem in People Search Query Search engine Results Which pages for what persons?

Person Name Clustering Query Each page ina cluster refers to the same entity. Search engine Search result Clusters of Web pages

Sample Systemquery= Ichiro Suzuki:famous Japanese baseball player Keywords about the person Documents about the same person 数理情報学輪講(2008/04/18)

Output Example （Ichiro Suzuki） Painter Used as an example name because Ichiro is so famous Dentist Lawyer 数理情報学輪講(2008/04/18)

Problem Setting • Given: a set of Web pages returned from a search engine when entering person name queries • Goal: to cluster Web pages • One cluster for one entity • Possibly with related information (e.g., biography and/or related words) Another usage : If a person has many aspects, like scientist and poet, these aspects are grouped together. Easy to grasp who he/she is.

Example: Sakai Shuichi Sakai shuichi is a professor of the University of Tokyo in the field of Computer Architecture: These pages are about his books of Computer Architecture He is a Japanese poet too. These pages are about his collection of poems .

Example: Famous car maker”TOYOTA” These pages are about TOYOTA’s retailer’s network These pages are about TOYOTA HOME which is a house maker and one of TOYOTA company’s group enterprise

Difference from Other Tasks • Cluster documents for the same person • Difficult to use training data for other person names Unknown but exact # in real world

WSD: Word Sense Disambiguation I was strolling the bank. Do you use a bank card there? Did you go to the bank? ? bank

(1) Heavy and sophisticated NLP tools such as HPSG parser is not suitable for the purpose. • (2)The system should work in tolerant speed • light weight tools is needed Challenges • Noisy Web data • Light linguistic tools • POS taggers, Stemmer, NE taggers • Pattern-based information extraction • How to use “training data” • Most systems use unsupervised clustering approach • Some systems assume “background knowledge” • How to determine K (number of clusters) Remember this Kdoes not depend on users intention but is exact and fixed, in real use. Different form usual clustering!

History (Word Sense Disambiguation) • (Coreference Resolution) 1998 Cross-document coreference Resolution [Bagga+, 98] – Naive VSM • Disambiguation for Web Search Results • [Mann+, 03] – Biographic data 2003 2007 Web People Search Workshop (WePS) [Artiles+, 07][Artiles+, 09]

History • Web People Search Workshop • 1st, SemEval-2007 • 2nd, WWW-2009 • Document Clustering • Attribute Extraction • 3rd, CLEF-2010(Conference on Multilingual and Multimodal Information Access Evaluation )20-23 September 2010, Padua. • Document Clustering & Attribute Extraction • Organization Name Disambiguation

WePS2 Data Source: 30names

WePS2 Data 1 (Artiles+, 09)

WePS2 Data 2

WePS2 Data 3

WePS2 summary report

Main Steps • Preprocessing • Feature extraction • Feature weighting / Similarity calculation • Clustering • (Related Information Extraction)

Preprocessing

In addition, alphabetically ordered name list page. (Ono+, 08) Preprocessing • Filter out useless pages (“junk pages”) • the name is matched, but the matched string doesn’t refer to a person (e.g., company name) • Data cleaning • HTML Tag removal • Sentence (snippet) extraction • Coreference resolution(used by Bagga+) In fact, very difficult task of NLP

Junk Page Filtering • SVM-based classification (Wan+, 05) • features • Simple lexical features • Stylistic features (fonts / tags) • query-relevant features (next-to-query words) • linguistic features (NE counts) … words related or not related to the person name i.e. how many and which words in bold font Such as how many person, organization, location name appear.

Feature Extraction

Feature Extraction • How to characterize each name appearance • Name itself can not be used for disambiguation! • Each name appearances can be characterized by contexts. • Possible contexts • Surrounding words, adjacent strings, syntactically related words, etc. • Which to use?

Basic Approach • Use all words in documents • Or snippets (texts around the name) • Or titles/summaries (first sentence, etc.) • Use TFIDF weighting scheme

Problem • There exist: • relatively useful features and relatively useless features • (especially for person name disambiguation) • Useful: NEs, biography, noun phrases, etc. • Useless: General words, boilerplate, etc. • How to distinguish useful features from others • How to give weight to each feature

Named Entities • Documents about Bill Gates related organization name related person name

Noun Phrases • Documents about Bill Gates related key words

Other Words • Documents about Bill Gates

Other Words • Documents about Bill Gates more important

Extracting Useful Features Based on score related to our purpose: TFIDF etc. • Thresholding • Tool-based approach • POS tagging, NE tagging • Information Extraction approach • Meta-data approach • Link structures, Meta tags Later described by Yoshida

Thresholding • Calculate TFIDF scores of words • Discard the words with low TFIDF scores • Unigram, Bigram, even N-gram can be used (Chen+, 09) , where Google 5 gram corpus (from 1T words) is used to calculate TFIDF score • Other Scores: such as Log-Likelihood Ratio, Mutual information, KL-divergence,

Tool Based Approach High performance POS taggers are developed for many languages. For western languages , stemmers are also developed . • Available Tools: • POS tagging • NE Extraction (sophisticated unsophisticated but simple) bigram, N-gram • Keyword extraction middle between NE and bigram,N-gram

Part of Speech (POS) Tagging • Detect the grammatical categories of the words • Nouns, verbs, prepositions, adverbs, adjectives, … • Typically nouns are used as features • Noun phrases can be extracted with some simple rules • Many available tools (e.g., Tree Tagger) William Henry "Bill" Gates III (bornOctober 28, 1955) isan Americanbusiness magnate, philanthropist,… NOUNS VERB NOUNS VERB DETERMINER ADJECTIVE NOUNS

Named Entities (NE) Extraction • Find “proper names” in texts • e.g., names of persons, organizations, locations, … • Include time expressions in many cases • Many available tools (Stanford NER, OpenNLP, Espotter, …) William Henry "Bill" Gates III (born October 28, 1955) is an American business magnate, philanthropist,… PERSON DATE

Key Phrase Extraction • Noun phrases consisting of 2 or more words • Likely to be topic-related concepts • Term-extraction tool “Gensen”(Nakagawa+, 05) • Noun phrases with the score of “term-likelihood” • Topic related term -> higher score Gates held the positions of CEO and chief software architect, and remains the largest individual shareholder … SCORE=45.2 SCORE=22.4

Gensen(言選) Web Score From corpus we extract: 信息処理, 計算機処理能力, 処理段階, 信息処理学会 Information proc, computer proc. capacity, proc.step, info. proc.society L:# of left adjacent words:２＋１ R:# of right adjacent words: :３＋１能力 capacity 信息 Information 段階 step 処理 Processing (=proc) 計算機 computer 学会 society L(W=処理)=2+1 R(W=処理)=3+1 LR(W=処理)=3×4=12

Calculation of LR and FLR • Compound word:W ={ w1, ... , wn} wherewiis a simple noun. • L(wi) = # of left side connection of wi+1 • R(wi) = # of right side connection of wi+1 • Score LR of Comp. word:W={ w1... wn}, like信息処理学会　isdefined as follows: • Example :LR(信息処理) =[L(信息)×R(信息) × L(処理)×R(処理) ]1/4 Or LR(information processing) =[L(info.)×R(info.) × L(proc.)×R(proc.) ]1/4 Normalized by length

Calculation of LR and FLR Normalized by length • F(W) is the independent frequencyof comp. word:Wwhere “independent” means that W is not a part of longer comp. word. • Then FLR（W) is defined as • FLR（W） = F(W) × LR(W) • Example FLR(信息処理) =F (信息処理)×[L(信息)×R(信息) × L(処理)×R(処理) ]1/4 This FLR is the score to rank term candidates F(W) has similar effect as TF. Then, if corpus is big, F(w) affects more to FLR(w).

Example of term extraction by Gensen Web: Englisharticle:SVM on Wikipedia Support vector machines (SVMs) are a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis. The original SVM algorithm was invented by Vladimir Vapnik and the current standard incarnation (soft margin) was proposed by Corinna Cortes and Vladimir Vapnik[1]. The standard SVM is a non-probabilistic binary linear classifier, i.e. it predicts, for each given input, which of two possible classes the input is a member of. Since an SVM is a classifier, then given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other. Intuitively, an SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. ……. Another approach is to use an interior point method that uses Newton-like iterations to find a solution of the Karush-Kuhn-Tucker conditions of the primal and dual problems.[10] Instead of solving a sequence of broken down problems, this approach directly solves the problem as a whole. To avoid solving a linear system involving the large kernel matrix, a low rank approximation to the matrix is often used to use the kernel trick.

Extracted terms (term score) Top 1-17 hyperplane 116.65 margin 109.54 SVM 74.08 vector 56.12 point 52.85 support vector 49.34 training data 48.12 data 47.83 problem 44.27 space 44.09 data point 38.01 classifier 30.59 classification 29.58 optimization problem 26.05 set 25.30 support vector machine 24.66 kernel 21.00 Top 18-38 set of point 20.73 linear classifier 19.99 maximum-margin hyperplane 19.92 example 19.60 one 17.32 Vladimir Vapnik 15.87 parameter 14.70 linear SVM 14.40 training set 14.00 optimization 13.42 model 12.25 training vector 12.04 support vector classification 11.70 two classe 11.57 normal vector 11.38 kernel trick 11.22 maximum margin classifier 11.22 Top 408–426(last) Vandewalle 1.00 derive 1.00 it 1.00 Leisch 1.00 2.3 1.00 H1 1.00 c 1.00 Hornik 1.00 mean 1.00 testing 1.00 transformation 1.00 unconstrained 1.00 homogeneous 1.00 need 1.00 learner 1.00 grid-search 1.00 convex 1.00 See 1.00 trade 1.00 .....

ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems