180 likes | 194 Views
InfoXtract Location Normalization: A Hybrid Approach to Geographic References in Information Extraction May 31 2003 Edmonton, Alberta. NAACL-HLT Workshop on the Analysis of Geographic References Huifeng Li, Rohini K. Srihari , Cheng Niu, and Wei Li Cymfony Inc. Contents.
E N D
InfoXtract Location Normalization: A Hybrid Approach to Geographic References in Information Extraction May 31 2003 Edmonton, Alberta NAACL-HLT Workshop on the Analysis of Geographic References Huifeng Li, Rohini K. Srihari, Cheng Niu, and Wei Li Cymfony Inc.
Contents • Overview of Information Extraction System: InfoXtract • Introduction of Location Normalization (LocNZ) • Task of LocNZ • Problems and Proposed Method • Algorithm for LocNZ • Experimental Evaluation • Future Work
Overview of InfoXtract • InfoXtract produces the following information objects from a text • Named Entities (NEs) - “Bill Gates, chairman of Microsoft ….” • Correlated Entities (CEs) - “Bill Gates, chairman of Microsoft...” • Subject-Verb-Object (SVO) triples - Both syntactic & semantic forms of the structures • Entity Profiles - Profiles for entity types like people&organizations • General Events (GEs) - Domain-independent events • Event • Argument structures centering around verb with the associated information • “whodidwhattowhomwhen(orhowoften)andwhere” • Predefined Events (PEs) - Domain-specific events • System component: integrated NLP and machine learning into IE • POS tagging • Shallow and deep parsing • Named Entity tagging • Combining supervised & unsupervised machine learning techniques • Concept-based analysis • Word sense disambiguation • Location / Time normalization • Co-reference analysis • Entity Profile fusion • Event extraction, fusion and linking
InfoXtract Architecture Document Processor Linguistic Processor(s) HTTP Tokenizer POST Process Zoned Text Web Manager Document Server Source Document Lexicon Lookup Tokenlist POS Tagging NE XML Named Entity Output Tokenlist Formatted Detection Manager Extracted Document CE Time HTTP Normalization response Location SVO Document Normalization & Shallow Error log CO Parsing Knowledge Resources Deep Parsing Lexicon Resources Profile Relationship Detection Grammars Alias/Coreference Linking CGE Language models Pragmatic Filtering Profile/Event Legend Merge Grammar Module PE Legend Profile/Event HTTP Procedure or Linking Statistical Model Hybrid CORBA Module
Introduction of Location Normalization • Task of location normalization (LocNZ) • Identify the correct sense of ambiguous location named entity (1) Decide if a location name is a city, a province or a country • Support NE Tagger to decide sub-tag New York (NeLoc) =>New York (NeLoc, NeCty) (2) Decide which city, state or country do a city, island or state belongs to • 18 states have city of Boston • Boston => Alabama, Arkansas, Massachusetts, Missouri,… • Result of LocNZ can be used to (1) Support event extraction, merging and event visualization Indicate where the event occurred (2) Support profile generation Provide location information of a person or an organization (3) Supportquestion answering Provide location area for document categorization
Event and Profile Generation Event Template Argument structures centering around verb with the associated information Profile Template presenting the subject's most noteworthy characteristics and achievements <PersonProfile 001> :: Name: Julian Werner Hill Position: Research chemist Age: 91 Birth-place: <LocationProfile100> Affiliation: Du Pont Co. Education: MIT <LocationProfile 100> :: Name: St. Louis State: Missouri Country: United States of America Zipcode: 63101 Lattitude: 90.191313 Longitude: 38.634616 Related_profiles: <PersonProfile 001> Input: Alvin Karloff was replaced by John Doe as CEO of ABC at New York last month. <General Event id=200> : key verb: replace who: John Doe whom-what: Alvin Karloff complement: CEO of ABC when: last month Where: <LocationProfile101>
Event Visualization • Result of LocNZ Indicates the place of an event occurred Event type: <Die: Event 200> Who: <Julian Werver Hill: PersonProfile 001> When: 1996-01-07 Where: <LocationProfile103> Preceding_event: <hospitalize: Event 260> Subsequent_event: <bury: Event 250> Predicate: Die Who: Julian Werner Hill When: 1996-01-07 Where: <LocationProfile 103> Event Visualization ; ; ; ;
Problems in Location Normalization • Difference between LocNZ and general WSD • Selection restriction is not sufficient • WSD: verb sense tagging relies mainly on co-occurrence constraints of semantic structures,Verb-Subject and Verb-Object in particular • LocNZ:depends primarily on the co-occurrence of related location entities in the same discourse (text) • Less clues in a text than verb and noun sense disambiguation ‘located in’can indicate ‘San Francisco’ is a location only Example) The Golden Gate Bridge is located in San Francisco • Lack of sources for default senses of location names • Tipster Gazetteer provides only small part of default senses • Little previous research on solving LocNZ
Major Types of Ambiguities • City versus country and state name ambiguity • Canada (CITY) Kansas (PROVINCE 1) United States (COUNTRY) • Canada (CITY) Kentucky (PROVINCE 1) United States (COUNTRY) • Canada (COUNTRY) • New York state versus New York city • Same city name among different provinces ambiguity • - 33 Washington entries in the Gazetteer • Washington (CITY) Arkansas (PROVINCE 1) United States (COUNTRY) • Washington (CITY) California (PROVINCE 1) United States (COUNTRY) • Washington (CITY) Connecticut (PROVINCE 1) United States (COUNTRY) • Washington (CITY) District of Columbia (PROVINCE 1) United States (COUNTRY) • Washington (CITY) Georgia (PROVINCE 1) United States (COUNTRY) • Washington (CITY) Illinois (PROVINCE 1) United States (COUNTRY) • Washington (CITY) Indiana (PROVINCE 1) United States (COUNTRY) • Washington (CITY) Iowa (PROVINCE 1) United States (COUNTRY) • Washington (CITY) Kansas (PROVINCE 1) United States (COUNTRY) • Washington (CITY) Kentucky (PROVINCE 1) United States (COUNTRY) … … • … …
Example of Text with Location NamesCNN news:http://www.cnn.com/2003/WEATHER/02/19/winter.storm.delays.ap/index.html • A traveler gets the bad news as he looks at the departures list that shows all canceled flights at the Philadelphia International Airport. MIAMI (AP) -- Travelers heading to and from the Northeast faced continued uncertainty Tuesday, even as airports in the mid-Atlantic region began slowly digging themselves out from one of the worst winter storms on record. …… No flights left Florida for Baltimore-Washington International Airport until Tuesday afternoon. That airport was one of the hardest-hit by the storm, with a snowfall total of 28 inches. Rosanna Blum, 38, of Hunt Valley, Maryland, had a confirmed seat on a Miami to Baltimore flight Tuesday afternoon, but still wasn't optimistic that she'd actually have the chance to use it. …… Theresa York, from Maryland, works the phones at Miami Airport as she tries to find a flight back home. …… "It's surreal," said Dawn Shuford, 35, as she reclined against her suitcase in a darkened hallway at BWI. She'd been trying since Sunday morning to get home to Seattle. The Washington area's two other airports, Reagan National and Dulles, also had limited service. Marty Legrow, from Connecticut, rests on her suitcase at Ronald Reagan National Airport in Washington. Philadelphia International Airport resumed operations Tuesday but still expected to cancel about one-third of its flights. Flights slowly resumed at New York's LaGuardia, Kennedy and Newark airports, and Boston's Logan, where more than 2 feet of snow fell, had one runway open. …… Margie D'Onofrio, 48, of King Of Prussia, Pennsylvania, and a travel companion left the Bahamas on Sunday, hoping to fly back to Philadelphia. They made it to Miami, and D'Onofrio said she did not expect to be home anytime Tuesday. …… Passengers camped out overnight at many airports. Many fliers called ahead Tuesday and weren't clogging airports unnecessarily, Orlando International Airport spokeswoman Carolyn Fennell said.
Our Previous Method [Li et al. 2002] • (1) Lexical grammar processing with local context • Identify City or State • City of Buffalo; New York State • Disambiguate meaning of a word • e.g. Williamsville, New York, USA • e.g. Brussels, Belgium • Propagate the analysis result within a text where it appears • One sense per discourse (Gale, Yarowsky et al, 1992) • (2) Construct graph and calculate maximum weight spanning tree considering global informationwith Kruskal Algorithm • Node: Location name senses • Edge: Similarity weight between two location name senses • Calculate similarities between locations in the graph referring to predefined similarity table • Choose maximum weight spanning tree that reflects most probable location senses in the document (3) Default senseapplication • If similarity value is lower than a threshold, apply default senses
Problems of Previous Method • For MST calculation, sort all the weighted edges • In case there are many locations, and each location has over 20 senses, the number of edges will increase a lot, and edges sorting will take much time, and value weighting is not distinctive enough • Solution: Adopted Prim’s Algorithm for MST combined with heuristics • If a location has sense of country, then select that sense as the default sense of that location (heuristics1) • If a location has province or capital senses, then select that sense as default sense after local context application (heuristics2) • The number of location mentions and the distance between them are taken into account • Previous method could not reflect these factor • Assign weight to the sense nodes in constructed graph • Choose the node with maximum weight
Weight Calculation Table 1: Impact weight of Sense2 on Sense1
Weight Assigned to Sense Nodes Canada {Kansas, Kentucky, Country} Vancouver {British Columbia Washington port in USA Port in Canada} Charlottetown {Prov in USA, New York City, …} Toronto (Ontorio, New South Wales, Illinois, …} New York {Prov in USA, New York City, …} Quebec (city in Quebec, Quebec Prov, Connecticut, …} Prince Edward Island {Island in Canada, Island in South Africa, Province in Canada}
Modified Algorithm • Look up the location gazetteer to associate candidate senses for each location NE; • If a location has sense of country, then select that sense as the default sense of that location (heuristics); • Call the pattern matching sub-module for local patterns like “Williamsville, New York, USA”; • Apply the ‘one sense per discourse’ principle for each disambiguated location name to propagate the selected sense to its other mentions within a document; • Apply default sense heuristics for a location with province or capital senses; • Call Prim’s algorithm in the discourse sub-module to resolve the remaining ambiguities; • If the difference between the sense with the maximum weight and the sense with next largest weight is equal to or lower than a threshold, choose the default sense of that name from lexicon. Otherwise, choose the sense with the maximum weight as output.
Discussion • Note: Column 5~9 used heuristics of default senses • Local patterns (Col-4) alone contribute 12% to the overall performance • Proper use of defaults senses and the heuristics(Col-5) can achieve close to 90% • Prim’s algorithm (Col-7) is clearly better than the previous method using Kruskal’s algorithm (Col-6), with 13% • But both methods cannot outperform default senses • When using all three types of evidence, the new hybrid method performance of 96% shown in Col-9
Future Work • Extend the scope of location normalization • Extend processing scope • Physical structure famous building, bridge, airport, lake, street name,… • Extend gazetteer • Introduce more context information for disambiguation • Upgrade default meaning assignment