290 likes | 796 Views
BK TP.HCM Named Entity Disambiguation: A Hybrid Statistical and Rule-based Incremental Approach Hien Nguyen * (Ton Duc Thang University, Vietnam) Tru Cao (Ho Chi Minh City University of Technology, Vietnam) Semantic Web Group (VN-KIM) Faculty of Computer Science & Engineering
E N D
BK TP.HCM Named Entity Disambiguation: A Hybrid Statistical and Rule-based Incremental Approach Hien Nguyen*(Ton Duc Thang University, Vietnam) Tru Cao (Ho Chi Minh City University of Technology, Vietnam) Semantic Web Group (VN-KIM) Faculty of Computer Science & Engineering Ho Chi Minh City University of Technology *Email: hien@tut.edu.vn
Outline • Introduction • Wikipedia • Algorithm • Experimental results • Concluding remarks
Introduction: Named Entities • Named Entities (NE) are considered: people, organizations, locations, date, time, money, measures, percentage, etc. • Example • “Ms. Washington's candidacy is being championed by several powerful lawmakers including her boss, Chairman John Dingell (D., Mich.) of the House Energy and Commerce Committee.”
Introduction: Problem • Different NEs may have the same name. • “John McCarthy has been a staple of the Ultimate Fighting Championship since its second event on March 11, 1994.” John McCarthy John McCarthy(referee) • “John McCarthy, professor of computer science at Stanford University, who developed LISP.” John McCarthy John McCarthy(computer scientist) • “John McCarthy, Britain's longest-held hostage in Lebanon, has been set free after more than five years in captivity.” John McCarthy John McCarthy(journalist)
Introduction: Motivation • Web searches • Queries about Named Entities (NEs) constitute a significant portion of popular web queries (Bunescu et al., EACL 2006). • ~ 30% of search engine queries include person names (R. Guha et al., WWW 2004) • Named entity disambiguation can lead to improve effectiveness of search results on the web for popular named entities. • Web-based Information Extraction • Identifying exactly NEs in web pages can improve accuracy in IE tasks (e.g. extracting relationships between NEs). • Question & Answering • Identifying exactly NEs in questions can improve accuracy of answers
Introduction: NE Disambiguation • Mapping entity names (in a text) to actual entities in a KB of discourse (e.g. Wikipedia). • An ambiguous entity names are out of the KB • An ambiguous entity names occur in the KB, but they refer to named entities out of the KB • An ambiguous entity names refer to two or more than named entities in the KB
Introduction: NE disambiguation But much like the first presidential debate held two weeks ago in Oxford, Mississippi, a draw for Obama would be considered a win.
Introduction: NE disambiguation Gamsakhurdia is seen as a national hero by those who mourn him Zviad Gamsakhurdia, Georgia's first president after independence from the USSR, has been buried in the capital Tbilisi 14 years after his death.
NE disambiguation John McCarthy, 'great man' of computer science, wins major award
Introduction: Approach • Disambiguation based on context • Co-occurring entity names • Co-occurring NE identifiers • Tokens in a window context centered at a name in consideration • Disambiguation based on a KB • We view that instances in the KB have two in formation • Attributes • Relations • We represent those instances by their attributes and relations
Introduction: Approach Text containing ambiguous names Wikipedia article • All keywords in the window text centred around the ambiguous name • The whole text is extended with page titles of the previously identified NEs enclosed • Entity page titles • Redirecting page titles • Category labels • Hyperlink labels Heuristics +TF-IDF vector similarity
Wikipedia • Wikipedia is a free encyclopedia written by a collaborative effort of global community of more than 150,000 volunteers • These volunteers have contributed more than 11 million articles in 265 languages • More than 275 million people visit Wikipedia site every month • 2,697,848 articles in English version (visiting Jan 14th, 2009)
Wikipedia – Pages &Titles Page Titles (ID)
Wikipedia – Pages &Titles Disambiguation text
Wikipedia – Category Category
Wikipedia – Redirect pages Redirect page titles
Wikipedia – Hyperlinks Hyperlinks
Wikipedia – Hyperlinks Hyperlinks
Algorithm • Hybrid statistical and rule-based incremental algorithm: • Rule-based NE disambiguation • Utilizing Wikipedia disambiguation texts E.g. “… Rockville, Maryland …” , disambiguation text Maryland helps identifying Rockvilleis an area in Maryland
On Thursday morning, Sen. Barack Obama warned supporters not to get "cocky," while a few hours later McCain pledged to Pennsylvania voters he would erase Obama's lead by Election Day. Algorithm • Rule-based NE disambiguation (cont.) • Exploiting coreference relationship between referents: Propagation of the identified NE, if any, along its coreference chain E.g. • Extension of the whole text with the Wikipedia entity page titles of the identified NEs
Algorithm • After Rule-based stage, for remaining ambiguous names, matching the whole text vector with Wikipedia candidate entity pages The extracted context surrounding ambiguous names Wikipedia article • All keywords in the window text centred around the ambiguous name • The whole text is extended with page titles of the previously identified NEs enclosed • Entity page titles • Redirecting page titles • Category labels • Hyperlink labels TF-IDF vector similarity
Experimental results • Experiments: 10 news from CNN on Travel, Entertainment, World, World Business, and Americas
Experimental results • D1 obtained after running GATE • D2 obtained after GATE’s errors corrected
Experimental results • We measure accuracy as the total number of right assignments NE (in text)/Wiki NE divided by the total number of assignments
Experimental results • Results:
Concluding remarks • The proposed method is a hybrid and incremental process that utilizes previously identified NEs and related terms co-occurring with ambiguous names in a text for entity disambiguation • Work under investigation: • Disambiguating ambiguous cases when ambiguous names occur in a KB, but they refers to named entities out of the KB.
Thanks for your attention VN-KIM Group http://www.cse.hcmut.edu.vn/vn-kim/ Contact author:hien@tut.edu.vn or nthien97@yahoo.com