60 likes | 190 Views
Place Expressions: Use of Gazetteer DB in Annotation. Beth Sundheim SPAWAR Systems Center, San Diego beth.sundheim@navy.mil. What is being annotated?. Two pertinent efforts AQUAINT study (completed):
E N D
Place Expressions: Use of Gazetteer DB in Annotation Beth Sundheim SPAWAR Systems Center, San Diego beth.sundheim@navy.mil DRAFT – not for public release
What is being annotated? • Two pertinent efforts • AQUAINT study (completed): • A text mention of a place name is annotated with the unique ID of a gazetteer entry corresponding to the mention’s intended sense • ACE END task (in planning stages): • An ACE entity of type LOC or GPE is annotated with the corresponding gazetteer ID from the external END DB • If no corresponding entry exists, the entity is annotated with the ID of a new DB entry that captures the entity’s info on the place Notes: • AQUAINT study was manual annotation. The gazetteer is the Integrated Gazetteer Data Base (IGDB), which merges 4 source gazetteers • ACE END (Entity Normalization and Disambiguation) involves 2 of 3 place types (omits FAC) and a non-overlapping subset of IGDB DRAFT – not for public release
Example Text: The Russian Interior Minister announced today that over two and a half tons of explosives have been seized in various parts of Russia since the explosion in a square in downtown Moscow last Tuesday. A Muscovite bomb expert said that the explosion, in an underpass beneath Pushkin Square in that city, was caused by a 1.3 kg TNT time bomb. … It was officially announced that the explosion in central Moscow last Tuesday resulted in 120 casualties. Output: Six ACE “place” entities; 2 GPEs (not the FAC for Pushkin Sq.) are included in the END task (have named mention in doc): Russia entity seed db attribute -> IGDB place entry #12345 (primary name = Russian Federation) Moscow entity seed db attribute -> IGDB place entry #67890 (primary name = Moscow) DRAFT – not for public release
ACE END Status (no stats yet!) • Task parameters decided • Language: English • Domain/Genre: news (for pilot annotation, at least) • Corpora: ACE (LDC-provided) • Seed DB construction is nearing completion • An initial annotation tool (Callisto-based) is being prepared • Exploratory pilot annotation is planned • No funding has yet been identified to support production annotation DRAFT – not for public release
Annotation Stats from AQUAINT • Ground truth data in form of 18,900 annotated names in topically and geographically diverse corpora. ITA between 2 annotators on portion of the data: • 95.3% F-measure agreement on “link-or-no-link” decision • 87%-99% agreement on “which-link”, depending on gazetteer • No stats on annotation speed (would be misleading anyway, since annotators did more than just annotate the linkage) DRAFT – not for public release
Gazetteer DB Annotation Uses • Cross-doc IE and QA are major drivers (gazetteer DB provides attributes that help determine coreference and containment relations across a corpus); should also be useful for multidoc summ. • Any real-life application that requires geospatial grounding of textual place entities • Advanced user interaction in Q&A • Note: Not necessarily just for English docs/questions • Note: Similar points could be made re use of DBs of organization/person/artifact/etc. entities! DRAFT – not for public release