620 likes | 736 Views
ACLCLP-IR2010 Workshop. Web-Scale Knowledge Discovery and Population from Unstructured Data. Heng Ji Computer Science Department Queens College and the Graduate Center City University of New York hengji@cs.qc.cuny.edu December 3, 2010. Outline.
E N D
ACLCLP-IR2010 Workshop Web-Scale Knowledge Discovery and Population from Unstructured Data Heng Ji Computer Science Department Queens College and the Graduate Center City University of New York hengji@cs.qc.cuny.edu December 3, 2010
Outline • Motivation of Knowledge Base Population (KBP) • KBP2010 Task Overview • Data Annotation and Analysis • Evaluation Metrics • A Glance of Evaluation Results • CUNY-BLENDER team @ KBP2010 • Discussions and Lessons • Preview of KBP2011 • Cross-lingual (Chinese-English) KBP • Temporal KBP
Limitations of Traditional IE/QA Tracks • Traditional Information Extraction (IE) Evaluations (e.g. Message Understanding Conference /Automatic Content Extraction programs) • Most IE systems operate one document a time; MUC-style Event Extraction hit the 60% ‘performance ceiling’ • Look back at the initial goal of IE • Create a database of relations and events from the entire corpus • Within-doc/Within-Sent IE was an artificial constraint to simplify the task and evaluation • Traditional Question Answering (QA) Evaluations • Limited efforts on disambiguating entities in queries • Limited use of relation/event extraction in answer search
The Goal of KBP Hosted by the U.S. NIST, started from 2009, supported by DOD, coordinated by Heng Ji and Ralph Grishman in 2010, 55 teams registered, 23 teams participated Our Goal Bridge IE and QA communities Promote research in discovering facts about entities and expanding a knowledge source What’s New & Valuable Extraction at large scale (> 1 million documents) ; Using a representative collection (not selected for relevance); Cross-document entity resolution (extending the limited effort in ACE); Linking the facts in text to a knowledge base; Distant (and noisy) supervision through Infoboxes; Rapid adaptation to new relations; Support multi-lingual information fusion (KBP2011); Capture temporal information (KBP2011) All of these raise interesting and important research issues
KBP Setup Knowledge Base (KB) Attributes (a.k.a., “slots”) derived from Wikipedia infoboxes are used to create the reference KB Source Collection A large corpus of newswire and web documents (>1.3 million docs) is provided for systems to discover information to expand and populate KB
Entity Linking: Create Wiki Entry? NIL Query = “James Parsons”
Entity Linking Task Definition Involve Three Entity Types Person, Geo-political, Organization Regular Entity Linking Names must be aligned to entities in the KB; can use Wikipedia texts Optional Entity linking Without using Wikipedia texts, can use Infobox values Query Example <query id="EL000304"> <name>Jim Parsons</name> <docid>eng-NG-31-100578-11879229</docid> </query>
Slot Filling: Create Wiki Infoboxes? <query id="SF114"> <name>Jim Parsons</name> <docid>eng-WL-11-174592-12943233</docid> <enttype>PER</enttype> <nodeid>E0300113</nodeid> <ignore>per:date_of_birth per:age per:country_of_birth per:city_of_birth</ignore> </query> School Attended: University of Houston
Data Annotation Overview Source collection: about 1.3 million newswire docs and 500K web docs, a few speech transcribed docs
Entity Linking Inter-Annotator Agreement Annotator 2 Annotator 1 Annotator 3
Slot Filling Human Annotation Performance Evaluation assessment of LDC Hand Annotation • Why is the precision only 70%? • 32 responses were judged as inexact and 200 as wrong answers • A third annotator’s assessment on 20 answers marked as wrong: 65% incorrect; 15% correct; 20% uncertain • Some annotated answers are not explicitly stated in the document • … some require a little world knowledge and reasoning • Ambiguities and underspecification in the annotation guideline • Confusion about acceptable answers • Updates to KBP2010 annotation guideline for assessment 14/35
Slot Filling Annotation Bottleneck The overlap rates between two participant annotators in community are generally lower than 30% Keep adding more human annotators help? No 15/23
Can Amazon Mechanical Turk Help? Given a q, a and supporting context sentence, Turk should judge if the answer is Y: correct; N: incorrect; U: unsure Result Distribution for 1690 instances 16/23
Why is Annotation so hard for Non-Experts? Even for all-agreed cases, some annotations are incorrect… Require quality control Training difficulties 17/23
Evaluation Metrics 18/35
Entity Linking Scoring Metric Micro-averaged Accuracy (official metric) Mean accuracy across all queries Macro-averaged Accuracy Mean accuracy across all KB entries 19/35
Slot Filling Scoring Metric Each response is rated as correct, inexact, redundant, or wrong (credit only given for correct responses) Redundancy: (1) response vs. KB; (2) among responses: build equivalence class, credit only for one member of each class Correct = # (non-NIL system output slots judged correct) System = # (non-NIL system output slots) Reference = # (single-valued slots with a correct non-NIL response) + # (equivalence classes for all list-valued slots) Standard Precision, Recall, F-measure 20/35
Evaluation Results 21/35
Top-10 Regular Entity Linking Systems <0.8 correlation between overall vs. Non-NIL performance 22/35
Human/System Entity Linking Comparison (subset of 200 queries) • Average among three annotators 23/35
System Overview Query Query Expansion External KBs Pattern Matching Free Base IE QA Wikipedia Answer Filtering Answer Validation TextMining Cross-System & Cross-Slot Reasoning Statistical Answer Re-ranking Priority-based Combination Inexact & Redundant Answer Removal Answer Validation 26/23 Answers
IE Pipeline • Apply ACE Cross-document IE (Ji et al., 2009) • Mapping ACE to KBP, examples: 27/23
Pattern Learning Pipeline • Selection of query-answer pairs from Wikipedia Infobox • split into two sets • Pattern extraction • For each {q,a} pair, generalize patterns by entity tagging and regular expressions e.g. <q> died at the age of <a> • Pattern assessment • Evaluate and filter based on matching rate • Pattern matching • Combine with coreference resolution • Answer Filtering based on entity type checking, dictionary checking and dependency parsing constraint filtering 28/23
QA Pipeline • Apply open domain QA system, OpenEphyra (Schlaefer et al., 2007) • Relevance metric related to PMI and CCP • Answer pattern probability: P (q, a) = P (q NEAR a): NEAR within the same sentence boundary • Limited by occurrence based confidence and recall issues 29/23
More Queries and Fewer Answers • Query Template expansion • Generated 68 question templates for organizations and 68 persons • Who founded <org>? • Who established <org>? • <org> was created by who? • Query Name expansion • Wikipedia redirect links • Heuristic rules for Answer Filtering • Format validation • Gazetteer based validation • Regular expression based filtering • Structured data identification and answer filtering 30/23
Motivation of Statistical Re-Ranking • Union and voting are too sensitive to the performance of baseline systems • Union guarantees highest recall • requires comparable performance • Voting • assumes more frequent answers are more likely true (FALSE) • Priority-based combination • voting with weights • assumes system performance does not vary by slot (FALSE) 31/23
Statistical Re-Ranking • Maximum Entropy (MaxEnt) based supervised re-ranking model to re-rank candidate answers for the same slot • Features • Baseline Confidence • Answer Name Type • Slot Type X System • Number of Tokens X Slot Type • Gazetteer constraints • Data format • Context sentence annotation (dependency parsing, …) • … 32/23
MLN-based Cross-Slot Reasoning • Motivation • each slot is often dependent on other slots • can construct new ‘revertible’ queries to verify candidate answers • X is per:children of Y Y is per:parents of X; • X was born on date Y age of X is approximately (the current year – Y) • Use Markov Logic Networks (MLN) to encode cross-slot reasoning rules • Heuristic inferences are highly dependent on the order of applying rules • MLN can • adds a weight to each inference rule • integrates soft rules and hard rules 33/23
Name Error Examples <PER>Faisalabad</PER>'s <PER>Catholic Bishop</PER> <PER>John Joseph</PER>, who had been campaigning against the law, shot himself in the head outside a court in Sahiwal district when the judge convicted Christian Ayub Masih under the law in 1998. Nominal Missing Error Examples supremo/shepherd/prophet/sheikh/Imam/overseer/oligarchs/Shiites… Intuitions of using lexical knowledge discovered from ngrams Each person has a Gender (he, she…) and is Animate (who…) Error Analysis on Supervised Model classification errors spurious errors missing errors
Motivations of Using Web-scale Ngrams (Ji and Lin, 2009) • Data is Power • Web is one of the largest text corpora: however, web search is slooooow (if you have a million queries). • N-gram data: compressed version of the web • Already proven to be useful for language modeling • Google N-gram: 1 trillion token corpus
died in (a|an) _ accident car 13966, automobile 2954, road 1892, auto 1650, traffic 1549, tragic 1480, motorcycle 1399, boating 823, freak 733, drowning 438, vehicle 417, hunting 304, helicopter 289, skiing 281, mining 254, train 250, airplane 236, plane 234, climbing 231, bus 208, motor 198, industrial 187, swimming 180, training 170, motorbike 155, aircraft 152, terrible 137, riding 136, bicycle 132, diving 127, tractor 115, construction 111, farming 107, horrible 105, one-car 104, flying 103, hit-and-run 99, similar 89, racing 89, hiking 89, truck 86, farm 81, bike 78, mine 75, carriage 73, logging 72, unfortunate 71, railroad 71, work-related 70, snowmobile 70, mysterious 68, fishing 67, shooting 66, mountaineering 66, highway 66, single-car 63, cycling 62, air 59, boat 59, horrific 56, sailing 55, fatal 55, workplace 50, skydiving 50, rollover 50, one-vehicle 48, <UNK> 48, work 47, single-vehicle 47, vehicular 45, kayaking 43, surfing 42, automobile 41, car 40, electrical 39, ATV 39, railway 38, Humvee 38, skating 35, hang-gliding 35, canoeing 35, 0000 35, shuttle 34, parachuting 34, jeep 34, ski 33, bulldozer 31, aviation 30, van 30, bizarre 30, wagon 27, two-vehicle 27, street 27, glider 26, " 25, sawmill 25, horse 25, bomb-making 25, bicycling 25, auto 25, alcohol-related 24, snowboarding 24, motoring 24, early-morning 24, trucking 23, elevator 22, horse-riding 22, fire 22, two-car 21, strange 20, mountain-climbing 20, drunk-driving 20, gun 19, rail 18, snowmobiling 17, mill 17, forklift 17, biking 17, river 16, motorcyle 16, lab 16, gliding 16, bonfire 16, apparent 15, aeroplane 15, testing 15, sledding 15, scuba-diving 15, rock-climbing 15, rafting 15, fiery 15, scooter 14, parachute 14, four-wheeler 14, suspicious 13, rodeo 13, mountain 13, laboratory 13, flight 13, domestic 13, buggy 13, horrific 12, violent 12, trolley 12, three-vehicle 12, tank 12, sudden 12, stupid 12, speedboat 12, single 12, jousting 12, ferry 12, airplane 12, unrelated 11, transporter 11, tram 11, scuba 11, common 11, canoe 11, skateboarding 10, ship 10, paragliding 10, paddock 10, moped 10, factory 10
Discovery Patterns (Bergsma et al., 2005, 2008) (tag=N.*|word=[A-Z].*) tag=CC.* (word=his|her|its|their) (tag=N.*|word=[A-Z].*) tag=V.* (word=his|her|its|their) … If a mention indicates male and female with high confidence, it’s likely to be a person mention Gender Discovery from Ngrams
Discovery Patterns Count the relative pronoun after nouns not (tag=(IN|[NJ].*) tag=[NJ].* (? (word=,)) (word=who|which|where|when) If a mention indicates animacy with high confidence, it’s likely to be a person mention Animacy Discovery from Ngrams
Candidate mention detection Name: capitalized sequence of <=3 words; filter stop words, nationality words, dates, numbers and title words Nominal: un-capitalized sequence of <=3 words without stop words Margin Confidence Estimation freq (best property) – freq (second best property) freq (second best property) Confidence (candidate, Male/Female/Animate) > Full Matching: John Joseph (M:32) Composite Matching: Ayub (M:87) Masih (M:117) Relaxed Matching: Mahmoud (M:159 F:13) Hamadan(N:19) Salim(F:13 M:188) Qawasmi(M:0 F:0) Unsupervised Mention Detection Using Gender and Animacy Statistics
Mention Detection Performance • Apply the parameters optimized on dev set directly on the blind test set • Blind test on 50 ACE05 newswire documents, 555 person name • mentions and 900 person nominal mentions
Impact of Statistical Re-Ranking • 5-fold cross-validation on training set • Mitigate the impact of errors produced by scoring based on co-occurrence (slot type x sys feature) • e.g. the query “Moro National Liberation Front” and answer “1976”did not have a high co-occurrence, but was bumped up by the re-ranker based on the slot type feature org:founded 41/23
Impact of Cross-Slot Reasoning Brian McFadden | per:title | singers | “She had two daughter with one of the MK’d Westlife singers, Brian McFadden, calling them Molly Marie and Lilly Sue” 42/23
Slot-Specific Analysis A few slots account for a large fraction of the answers: per:title, per:employee_of, per:member_of, and org:top_members/employees account for 37% of correct responses For a few slots, delimiting exact answer is difficult …result is ‘inexact’ slot fills per:charges, per:title (“rookie driver”; “record producer”) For a few slots, equivalent-answer detection is importantto avoid redundant answers per:title again accounts for the largest number of cases. e.g., “defense minister” and “defense chief” are equivalent. 43/35
Why KBP is more difficult than ACE • Cross-sentence Inference – non-identity coreference(per:children) • Lahoud is married to an Armenian and the couple have three children. Eldest son Emile Emile Lahoud was a member of parliament between 2000 and 2005. • Cross-slot Inference (per:children) • People Magazine has confirmed that actress Julia Roberts has given birth to her third child a boy named HenryDaniel Moder. Henry was born Monday in Los Angeles and weighed 8? lbs. Roberts, 39, and husband Danny Moder, 38, are already parents to twins Hazel and Phinnaeus who were born in November 2006. 45/35
Cross-lingual Entity Linking Query = “吉姆.帕森斯”
Cross-lingual Slot Filling Other family: Todd Spiewak Query = “James Parsons”
Cross-lingual Slot Filling Two Possible Strategies 1. Entity Translation (ET) + Chinese KBP 2. Machine Translation (MT) + English KBP Stimulate Research on Information-aware Machine Translation Translation-aware Information Extraction Foreign Language KBP, Cross-lingual Distant Learning