Analysis of geographic references András Kornai, Beth Sundheim

HLT/NAACL03 workshop 31 May 2003 Analysis of geographic referencesAndrás Kornai, Beth Sundheim

Thanks to Sponsors: AQUAINT TIDES Program committee: Doug Appelt Merrick Lex Berman Sean Boisen Quintin Congdon Jim Cowie Doug Jones Linda Hill George Wilson Conference support: Ed Hovy James Allen Steven Abney Dragomir Radev Ali Hakim Dekang Lin

Program • 19 papers submitted, 12 accepted • 2 invited speakers • 2 discussion periods • Authors asked to email presentation to geowkshp@kornai.com by end of day

Changes • Afternoon invited speaker: Jerry Hobbs (ISI) replaces Randy Flynn (NIMA) • Paper presentation ordering: Li et al swapped with Manov et al (9:30am v 12:10pm) • Additional workshop event: Linda Hill (UCSB) poster during breaks

Workshop goals • Exchange information on work in the analysis and grounding of place names and other forms of geographic reference • Informally assess state of art in handling various aspects of the problem • Identify ways to follow up on workshop as a community

External resources • Diversity across projects: • ADL, Tipster, NIMA/USGS, UN-LOCODE, TGN, GB Historical GIS, web, … • Integrated resources: • KIM KB (Manov et al.), named entity word list in InfoXtract, extended multi-gazetteer MetaCarta db, … • Net result – how happy are we with current resources and integration solutions? • With coverage of named places, richness of information, utility for NLP analysis as well as for grounding references? • With using a named entity finder as an analysis preprocessor?

Entity finding in text • Some systems (for now) entirely manual • Semi-automated (with human review) • Fully automated • FS template matching • (Weighted) rule-based • HMM-based • Confidence-based

Disambiguation • What do we mean? • Discrimination between names of places and other types of names • Disambiguation of place reference by location of place • Disambiguation of place reference by type of place • How well do current techniques work, and what hard problems remain? • Relative difficulty given texts about U.S., detailed location references, historical texts • Relation to general word sense disambiguation problem • Use of non-local descriptive references, coreference, … • Co-occurrence of names with non-spatial clue terms (“San Francisco” and “earthquake”)

Disambiguation (2) Observations from Nov. ’02 name annotation round: • For 80% of all name instances, evidence from local context was enough to determine which gazetteer entry was the corresponding one in over 75% of cases • This augurs well for successful automation • No gazetteer linkage could be made for 20% of all name instances – either the name did not appear in the gazetteer at all (majority), or it appeared there in the wrong sense • This lack of gazetteer coverage presents a significant challenge

Failure modes (1) • Lack of complete match on name • St. Petersburg – no variant in gazetteer with “St[.]” • Multiple acceptable entries • [the] Crimea – one for “regions”, one for “capes” • Transliteration differences • Sheremetyevo -> Sheremet’yevo • Belarus -> Byelarus • Mismatch on feature type • Simferopol, Vladikavkaz – “capital” in doc, but not in gazetteer

Failure modes (2) • Many matching entries, but no clear winner • Prigorodny – 16 hits on Prigorod (many in Russia) • No entry for general places • Asia – no entry in gazetteer • Variant name missing from entry • America – no match in gazetteer (i.e., not a listed variant) • Name in doc matches wrong entry in gaz • The Heavenly Ski Resort – exactly matches entry with BUILDING feature, but correct entry is under Heavenly Valley Ski Area (with LOCALE feature in USGS GNIS and “sports facilities” feature in ADL gaz)

Foreign language • Example: TIDES surprise language exercise • Challenge: Develop resources and NLP tools for a foreign language in a month (June) • Can’t expect to find an existing placename gazetteer for this language • This language is likely to have a non-western script; ease of transliteration unpredictable

Community • Offerings from SPAWAR Systems Center: • Annotated corpora available to those with licenses for source texts, along with annotation protocol • “Modernized” (with respect to diacritics) Tipster gazetteer available upon request • Call for papers: • Special issue of TALIP journal on temporal and spatial information processing (Editors: Mani, Pustejovsky, Sundheim) • Submissions due December 1 – think about it!

Tagging • Finding the entity in text • Disambiguation • Type assignment • Grounding • Linking to unique gazetteer entry • Assigning coordinates

Annotation standards • Example: Automatic Content Extraction (ACE) • XML-based • Levels: mentions (instances), entities, inter-entity relations • Types of mentions: names, nominals (descriptive references), pronouns • Entity categories wrt places: LOCATION, FACILITY, GEOPOLITICAL ENTITY (GPE) • Each category has defined subtypes (new) • Scheme allows for metonymic usage and fuzzy meaning • Software tools to support manual annotation, output format transformation, annotation lookup and review • Entity and relation schemes could/should be elaborated further over time

Volume and pressure

Conclusions • Procedural input sought from participants: shall we summarize at the end? • Who is we: • Organizers? • Session chairs? • Committee members? • Panel?

Analysis of geographic references András Kornai, Beth Sundheim