Finding Authoritative People from the Web

Finding Authoritative People from the Web Masanori Harada, Shin-ya Sato, Kazuhiro Kazama {harada,sato,kazama}@ingrid.org NTT Network Innovation Labs.

Contents • Motivation • Why study finding people? • Examples • Approach • Extract personal names on the web • Find relevant people using a search engine • Results • Performance evaluation • Summary and future plans

Background As the web is connected to the real world, we can: • Find real-world things by searching the web. • Understand the real world by investigating the web (and vice versa). the real world ... connections the web searching

Objective Find authoritative people for all sorts of topics by extending a web search engine • Why find people? • Once people have been found, many other things (e.g. books) can be retrieved using digital libraries • What is authoritative? • People mentioned in many web pages with regard to a queried topic

Screenshot Relevant personal names Relationships Relevant web pages

Example (1) subject to people • “digital libraries” (1007 pages) • Possible application: book finder • Using library catalogs, it could suggest relevant books written by these authoritative people • Shigeo Sugimoto Univ. Library Information Science • Koichi Tabata Univ. Library Information Science • Jun Adachi National Institute of Informatics • Takeo Yamamoto National Institute of Informatics • Hiroyuki Taya National Diet Library

Example (2) thing to people • “Spirited Away” (35,936 pages) • Possible application: movie recommender • Using movie databases, it could suggest movies which share key people for any queried topic • Hayao Miyazaki director • Bunta Sugawara voice actor • Mari Natsuki voice actress • Yumi Kimura singer of theme song • Joe Hisaishi composer

Example (3) person to people • “Masanori Harada” (205 pages) • Possible application: social networking • Unlike social networking services like orkut, there is no need to enter relationships manually • Masanori Harada me • Shin-ya Sato co-author of this paper • Kazuhiro Kazama co-author of this paper • Kent Tamura web search researcher • Isao Asai web search researcher

NEXAS Named entity extraction and association search • Associate an entity and a web page by extracting names identifying the entity. • Find entities associated with top-ranked web pages retrieved for a query. A More relevant (authoritative) More relevant (authoritative) ・・・ B web search Less relevant Less relevant C Irrelevant

Extracting personal names • Web data • 52 million Japanese web pages collected in July 2003 • Japanese personal name extractor • Extracts only full names • Assumes a full name can identify a person • Big name dictionaries enables accurate extraction • Precision: 93.5%, Recall 85.3%

Personal names on the web • Personal names appear frequently on the web • 6.6M unique names extracted from 52M web pages • 1/4 of web pages contain full names • Celebrities appear >10 thousand times • singers, actors, sports stars, novelists, politicians, etc. • Most names appear only a few times • Name frequency indicates popularity • But number of pages is easily affected by automatically generated texts and spams • Number of servers is less affected • Authoritative people are mentioned in many servers

Procedure for finding people 1. Find web pages using a full-text search engine 2. List personal names extracted from top T relevant and authoritative web pages 3. Calculate relevance scores and output top k relevant people, who • Appear frequently on top-ranked web pages • Do not appear frequently on irrelevant web pages

Calculating relevance score Scoring functions: df, sf, dfidf,and sfidf df = document frequency within top T pages sf = server frequency within top T pages + 0.01df idf = log (N / fp) N = number of pages in the collection fp= document frequency sf : Alleviates effects of generated texts and spams idf : Weight for moderating generally famous person

Performance evaluation • Compared 4 scoring functions for varying T • Precision: average of scores of top k people • Judged if a person was relevant (Score: 1) or not (Score: 0) by searching for the personal name using Google • 45 simple topics were used • 15 musical instruments players or not? • 15 sports players or not? • 15 information technologies experts or not?

Precision of scoring functions • sf is very effective • idf is fairly effective sfidf sf dfidf df Precision Number of people evaluated

Precision with varying T • More web pages do not necessary yield better results sfidf T =500 T =1000 T =200 T =2000 T =5000 T =100 Precision Number of people evaluated

Future work • Apply for other languages, especially English • Can we distinguish different “John Smith”? • Find books, companies, shops, etc. • By extracting ISBNs, domain names, places, etc. • Analyze co-occurrences as a social network • Online demo is available athttp://valhalla.ingrid.org:28080/throughout the conference • * Japanese fonts and Java2 plug-in are required

Precison vs. result size • Too-specific or too-general topics are difficult. sfidf T=1000 Precision of top 10 people Number of pages retrieved for a topic “databases” “compiler theory”

Popular Japanese names • singers, actors, sports starts, novelists, politicians, etc. Table: Top 10 most frequent names

Related work • ReferralWeb [Kautz 1997] • Finds experts around the user by searching the web • Tested only with computer science topics • Web Question Answering [Kwok 2001][Brill 2001] • Retrieve one exact answer to a long, complete natural language question • Our contributions • Observed the distribution of personal names on the web • Extended a web search engine so that it accurately finds relevant people for all sorts of queries.

Common failures • Too specific topics • Too general topics • Name extraction errors • Falsely extracted non-names • Missed (not extracted) names • Historical/Fictional characters • Celebrities • Popular names often appear without regard to a specific topic

Numbers • Number of web pages 52,302,805 • Number of web servers 664,139 • Number of pages w/ names 13,922,012 • Number of name occurrences 117,091,977 • Number of unique names 6,161,805 • Total size of web pages 450GB • Size of inverted index 113GB • Size of dictionaries • Family names 21,141 • Personal names 12,130 • Full names 19,675

Topics for the experiment • Musical instruments • violin, cello, trumpet, clarinet, harp, percussion, synthesizer, ocarina, accordion, contrabass, pipe organ, and marimba • Sports • soccer, baseball, marathon, swimming, rugby, volleyball, basketball, boxing, badminton, ice hockey, speed skating, fencing, lacrosse, pole vault, and discus throw • Information technology terms • databases, java, information retrieval, XML, IPv6, speech recognition, P2P, data mining, machine translation, complexity theory, web search engines, probabilistic reasoning, simulated annealing, compiler theory, and randomized algorithms

Name extraction method • Procedure 1. Remove HTML tags. 2. Using a morphological analyzer, split each sentence into morphemes and assign Part-Of-Speech tags. 3. Extract <family name><personal name> sequences. • Performance improved by enriching dictionaries 17k families + 12k personals + 2k popular full names Precision 78.4%, Recall 75.0% 21k families + 40k personals + 19k popular full names Precision 93.5%, Recall 85.3%

The same-name problem • Not very serious when we query a subject • Different people having the same name rarely exist in a specific area • Japanese family/personal names are diverse • Still, not a few people share the same name • Solutions (under consideration) • Classifies web pages by topic • Analyze a social network around the name (Different people have different friends)

Popularity and frequency • Personal name frequencies indicate web users' interest in celebrities. • The number of pages is prone to be affected by automatically generated pages and spams. • The number of servers is better to find popular people.

Finding Authoritative People from the Web