Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development

Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński, Gregor Erbach, Clara Guasch, Petr Homola, Sabine Lehmann, Hong Li, Hans-Ulrich Krieger, Jakub Piskorski, Ulrich Schäfer, Atsuko Shimada, Melanie Siegel, Feiyu Xu, Dorothee Ziegler-Eisele DFKI GmbH, LT Lab, Saarbrücken Saarland UniversityComputational Linguistics Dept, Saarbrücken Acrolinx GmbH, Berlin

Outline • Motivation • SPROUT – shallow processing toolkit • Multilingual NE grammar development • Shared output structures • Shared token classes • Shared grammars • Multilingual NE corpora • Evaluation tool

Motivation • Named Entity Recognition is fundamental to a number of information management applications (search engines, question answering, text mining …) • Many of these applications deal with different languages • Development of multilingual named entity grammars, supported by BMBF in the projects WHITEBOARD and COLLATE, and by the EU in the project AIRFORCE

Challenges in multilingual NER • Different alphabets, character sets and character encodings • Different tokenization conventions • Different time and currency formats • Different representations of proper names • Identical (New York, George Bush, IBM) • Different for some languages (London vs. Londres, Firenze vs. Florence vs. Florenz, NATO vs. OTAN, München vs. Munich vs. Monaco)

SProUT - Objectives • platform for the development of multilingual and domain adaptive shallow text processing and information extraction systems • trade-off between efficiency and expressiveness • modularity (fine-grained modeling of linguistic components into clear-cut modules) • portability and industrial standards

LINGUISTIC PROCESSING RESOURCES LEXICAL RESOURCES INPUT DATA JTFS STREAM OFTEXT ITEMS …. [..] [..] [..] …. SHALLOW GRAMMAR EXTENDED OPTIMIZED FST REPRES. STRUCTURED OUTPUT DATA REGULAR COMPILER FINITE-STATE TOOLKIT SHALLOW GRAMMAR INTERPRETER System Architecture G R A M M A R D E V E L O P M E N T E N V I R O N M E N T O N L I N E P R O C E S S I N G

System Components • linguistic processing resources • tokenizer (easily adaptable for indo-european languages) • gazetteer • morphology component (8 languages) • named entity recognition (6 languages) • core tools • JTFS • FSM toolkit • regular compiler • shallow grammar interpreter • tries for NLP processing

TFS and TFS-XML • TFS as data interchange format in SProUT • unification and subsumption check as basic operations for evaluation • compact XML encoding of typed feature structures (following TEI-SGML) • exchange format for linguistic resources: • grammars • feature structure tree banks • exchange format for visualization

TFS-XML: Example <FS type="pred_argument"> <F name="PRED"> <FS type=„übernehmen"/> </F> <F name="AGENT"><FS coref="1" type="argument"> <F name="NAME"> <FS type="Maria_Müller"/> </F> </FS> </F> <F name="THEME"><FS coref="2" type="argument"> <F name="NOM"> <FS type="Vorsitz"/> </F> </FS> </F> </FS>

Morphological Resources Indo-European language resources • English 200,000 entries (Mmorph (Multext)) • German 830,000 entries (Mmorph (Multext)) • French 225,000 entries (Mmorph (Multext)) • Spanish 570,000 entries (Mmorph (Multext)) • Italian 330,000 entries (Mmorph (Multext)) • Czech 600,000 entries (Institue of Formal and Applied Linguistics in Prague) Asian language resources • Chinese Shanxi-Tokenizer • Japanese ChaSen

Architecture • Mmorph fullform lexica are stored as trie • external modules (Asian and Czech) are integrated via Client/Server Tokeniser ChaSen Czech Shanxi Mmorph Parser

A SProUT Grammar Rule (XTDL) *

Unification Fully Unified Structure Extended Rule Structure After Match Matched input structure

Title of Slid COLLATE, Scientific Advisory Board Meeting, Saarland University, 22 November 2002

Multilingual Named Entity Grammars • Languages • English, French, German, Spanish • Chinese, Japanese • Grammar Style • MUC-7/MET-2 named entity classes with some variations • ENAMEX: person, location, organisation • TIMEX: time point, time span (instead of date, time) • NUMEX: percentage, money • Named entity types with internal attribute-value structures, e.g., span := timex & [FROM point, TO point ].

Multilingual NE grammar development Our approach • Shared output structures • Shared token classes • Shared grammars

Shared Output Structures • The grammars for all six languages produce the same, semantically oriented output structures, defined in TDL ne_type := sign & [DESCRIPTOR string]. enamex := ne_type. ne-person := enamex & [TITLE list-of-strings, GIVEN_NAME list-of-strings, SURNAME list-of-strings, P-POSITION list-of-strings, NAME-SUFFIX string]. ne-location := enamex & [LOCTYPE loc-type, LOCNAME string]. loc-type :< atom. river := loc-type. continent := loc-type. country := loc-type. province := loc-type. city := loc-type.

Shared Token Classes • A single set of token classes is used for the European languages NATURAL_NUMBER 12344 FLOATING_POINT_NUMBER 123,43 NUMBER_PERCENT_COMPOUND 34,4% NUMBER_DOT_COMPOUND 234.345.545. NUMBER_WORD_COMPOUND 2,4-fachen DIGIT_SLASH_COMPOUND 12/01/1998 DIGIT_DASH_COMPOUND 12-01-1998 DIGIT_COLON_COMPOUND 15:13 ALL_CAPS_WORD ABC LOWERCASE_WORD tokenization FIRST_CAPITAL_WORD Microsoft MIXED_WORD_FIRST_CAPITAL GmbH MIXED_WORD_FIRST_LOWER dKK

Shared Grammars • SPROUT supports re-use and extension of grammars • This feature has been used for the development of multilingual parallel grammars for English, Spanish and French • Common parts of the grammar for different languages (e.g. date formats like „20.10.2003“) are stored in one file, and combined with the language-specific parts of the grammars (for structures like „20 de octubre del 2003“) • Common proper names such as „Amsterdam“ are stored in generic gazeteer, while language-specific names such as „Brussels“, „Bruxelles“, „Bruselas“ are stored in language-specific lists

Advantages of shared grammars • Grammars are more easily re-usable and extendible • Consistency is improved, as changes must only be made in one place for shared structures • Grammar development is more efficient, and less time-consuming and error-prone • The same methodology has been applied for combining general-language grammars with domain-specific grammars

Re-use of corpora • We use NE-annotated corpora for grammar development and evaluation of grammars • Special-purpose annotation of corpora is only feasible for large-scale evaluations such as MUC, but exceeds the resources of most application-oriented projects • Corpora from other projects are re-used in order to save labour and have larger evaluation resources • There may be mismatches between corpus annotation and grammar output

Multilingual NE corpora • English corpora from the MUC7 evaluations • Japanese and Chinese corpora annotated according to MUC7 conventions • German corpora annotated in the COLLATE project with a superset of MUC7 annotations • German, English, French and Spanish texts annotated with Named Entities, from Joint Research Centre • Spanish data from the CoNLL-2002 Language-Independent NER task • English and French corpora from the business domain annotated with named entities according to the MUC7 guidelines within our project

Issues with re-use of corpora • The corpora contain differences in • Annotation format • Types of named entities annotated • Attributes used to describe each NE • Superficial differences in annotation format are handled by conversion to XML • Differences in the content of the annotation are not handled by modification of the corpora, but rather by making our evaluation tool more flexible

semantic relations named entities + coreference Structure of Annotated Articles <Firmenmeldung Annotator=“…” ID=“…” Status=“…”> <teiHeader> <fileDesc> <titleStmt> <author>…</author> </titleStmt> <publicationStmt> <publisher>…</publisher> <pubPlace>…</pubPlace> <date>…</date> </publicationStmt> <sourceDesc> <bibl> <agency>…</agency> <page>…</page> <topic>…</topic> <domain>…</domain> </bibl> </sourceDesc> </fileDesc> </teiHeader> <sourceText>… </sourceText> … <text> … </text> </Firmenmeldung>

Annotation of Semantic Relations Robert Bosch GmbH, Stuttgart: Der Kfz-Zulieferkonzern übernimmt zum 1. Januar die van Doorne's Transmissie b. v., Tilburg. Das niederländische Unternehmen, das im letzten Jahr mit 220 Mitarbeitern einen Umsatz von 45 Millionen DM erzielte, entwickelt stufenlose auto- matische Automobilgetriebe (CVT = Continuously Variable Transmission) und produziert Komponenten für CVT. <Firma Branche="Kfz-Zulieferkonzern" Firma="Robert Bosch" Rechtsform="GmbH" Sitz="Stuttgart"/><Firma Firma="van Doorne's Transmissie" Land="NL" Rechtsform="b. v." Sitz="Tilburg"/> <Beschaeftigung Firma="van Doorne's Transmissie" Mitarbeiter="220"/> <Umsatz Betrag="45 Mill." Firma="van Doorne's Transmissie" Waehrung="DEM"/> <Uebernahme Kaeufer="Robert Bosch" Objekt="van Doorne's Transmissie"/> • acquisition • company • corporateStructure • dividends • newBusiness • offer • occupation • premiumIncome • profit • relocation • revenue • turnover

Annotation of Named Entities Robert Bosch GmbH, Stuttgart: Der Kfz-Zulieferkonzern übernimmt zum 1. Januar die van Doorne's Transmissie b. v., Tilburg. Das niederländische Unternehmen, das im letzten Jahr mit 220 Mitarbeitern einen Umsatz von 45 Millionen DM erzielte, entwickelt stufenlose auto- matische Automobilgetriebe (CVT = Continuously Variable Transmission) und produziert Komponenten für CVT. <NE Organisation="Robert Bosch GmbH">Robert Bosch GmbH</NE> ,<NE Ort="Stuttgart">Stuttgart </NE> : Der Kfz-Zulieferkonzern übernimmt zum <NE Zeit="01.01."> 1. Januar</NE> die <NE Organisation="van Doorne's Transmissie b. v.">van Doorne's Transmissie</NE> , <NE Ort="Tilburg">Tilburg</NE> . • function • location • money • number • ordinalNumber • organization • percent • personName • productName • scaleUnit • time

Annotation of Coreference LM Ericsson AB, Stockholm: Der schwedische Elektronikkonzern hat … <expid="101">LM Ericsson AB</exp>, Stockholm: <expid="102"><ptrsrc="101"/> Der schwedische Elektronikkonzern</exp> hat … • 3rd person personal pronouns • 3rd person possessive pronouns and determiners • demonstrative pronouns and determiners • indefinite pronouns and determiners • anaphoric and cataphoric adverbs • elliptical nominal phrases • anaphoric and cataphoric nominal phrases

Cooperation: Annotation of FR TIGER: syntactic annotation LLX: FrameNet annotation <s id="s37"><graph root="s37_503"><terminals> <t id="s37_1" word="Ausgerechnet" pos="ADJD" morph="--" /> <t id="s37_2" word="Iggy" pos="NE" morph="Masc.Nom.Sg" /> <t id="s37_3" word="Pop" pos="NE" morph="*.Nom.Sg" /> <t id="s37_4" word="verkörpert" pos="VVFIN" morph="3.Sg.Pres.Ind" /> <t id="s37_5" word="gesanglich" pos="ADJD" morph="Pos" />...</terminals><nonterminals> <nt id="s37_500" cat="MPN"> <edge label="PNC" idref="s37_2"/> <edge label="PNC" idref="s37_3"/> </nt> <nt id="s37_501" cat="NP"> <edge label="NK" idref="s37_6"/> <edge label="NK" idref="s37_7"/> </nt>...</nonterminals></graph></s> <REQUEST><SPKR> SPD </SPKR> <FEE> fordert </FEE><ADD> Koalition </ADD> <MSG> zu Gespr"ach "uberReform </MSG> <FEE> auf </FEE>. </REQUEST><CONVERSATION>SPD fordert <INTLC-1> Koalition </INTLC-1> zu<FEE> Gespr"ach </FEE> <TOPIC> "uber Reform </TOPIC>auf. </CONVERSATION>

Cooperation: Multi-layer Annotation TIGER: syntactic annotation <Firmenmeldung Annotator="keku" ID="SZ_401" Status="1"> <teiHeader> <fileDesc> <titleStmt> <author/> </titleStmt> <publicationStmt> <publisher>SZ</publisher> <date>1995-03-31</date> </publicationStmt> <sourceDesc> <bibl> <agency>vwd</agency><page>22</page> <topic>Wirtschaft</topic> <domain>Firmenmeldungen</domain></bibl> </sourceDesc> </fileDesc></teiHeader> <sourceText>Datev eG, NÃ¼rnberg: Der EDV-Dienstleister fÃ¼r Steuerberater hat 1994 den Umsatz laut vorlÃ¤ufigen Zahlen um 5% auf rund 980 Mill. DM gesteigert. Die Anzahl der Mitarbeiter ist auf 4605 (4474) BeschÃ¤ftigte gestiegen, die Zahl der Genossenschaftsmitglieder zog auf 34246 (33551) an. Die Investitionen von 115 (93) Mill. DM haben sich in erster Linie auf die Modernisierung derGroÃrechner, den PC-Bereich sowie auf ein automatisches Versandlager konzentriert.</sourceText> <Firma Branche1="EDV-Dienstleister fÃ¼r Steuerberater" Firma="Datev eG" Sitz1="NÃ¼rnberg" Rechtsform="eG"/> <Umsatz Firma="Datev eG" Differenz="5%" Trend="plus" Betrag1="980 Mill." Waehrung1="DEM" Beschreibung1="rund" Zeit="1994"/> <Beschaeftigung Firma="Datev eG" Trend="plus" Mitarbeiter1_alt="4474" Mitarbeiter1_neu="4605" Zeit="1994"/> <text> <NE Organisation="Datev eG">Datev eG</NE>, <NE Ort="NÃ¼rnberg">NÃ¼rnberg</NE>: Der EDV-Dienstleister fÃ¼r Steuerberater hat <NE Zeit="1994">1994</NE> den Umsatz laut vorlÃ¤ufigen Zahlen um <NE Prozentzahl="5%">5%</NE> auf <NE Geld="rund 980 Mill. DEM">rund 980 Mill. DM</NE> gesteigert. Die Anzahl der Mitarbeiter ist auf <NE Zahl="4605">4605</NE> (<NE Zahl="4474">4474</NE>) BeschÃ¤ftigte gestiegen, die Zahl der Genossenschaftsmitglieder zog auf <NE Zahl="34246">34246</NE> (<NE Zahl="33551">33551</NE>) an. Die Investitionen von <NE Geld="115 (93) Mill. DEM">115 (93) Mill. DM</NE> haben sich in erster Linie auf die Modernisierung der GroÃrechner, den PC-Bereich sowie auf ein automatisches Versandlager konzentriert. </text></Firmenmeldung> COLLATE: semantic annotation <s id="s37"><graph root="s37_503"><terminals> <t id="s37_1" word="Ausgerechnet" pos="ADJD" morph="--" /> <t id="s37_2" word="Iggy" pos="NE" morph="Masc.Nom.Sg" /> <t id="s37_3" word="Pop" pos="NE" morph="*.Nom.Sg" /> <t id="s37_4" word="verkörpert" pos="VVFIN" morph="3.Sg.Pres.Ind" /> <t id="s37_5" word="gesanglich" pos="ADJD" morph="Pos" />...</terminals><nonterminals> <nt id="s37_500" cat="MPN"> <edge label="PNC" idref="s37_2"/> <edge label="PNC" idref="s37_3"/> </nt> <nt id="s37_501" cat="NP"> <edge label="NK" idref="s37_6"/> <edge label="NK" idref="s37_7"/> </nt>...</nonterminals></graph></s> LLX: FrameNet annotation <REQUEST><SPKR> SPD </SPKR> <FEE> fordert </FEE><ADD> Koalition </ADD> <MSG> zu Gespr"ach "uberReform </MSG> <FEE> auf </FEE>. </REQUEST><CONVERSATION>SPD fordert <INTLC-1> Koalition </INTLC-1> zu<FEE> Gespr"ach </FEE> <TOPIC> "uber Reform </TOPIC>auf. </CONVERSATION> => multi-layer annotated language resource

Evaluation Tool: jTaCo • Evaluates grammars wrt. an annotated corpus • Removes annotations from corpus, and feeds unannotated text to grammar • Compares grammar output with original annotated texts • Produces detailed statistics, evaluation scores, and diagnostic output

Configuration of jTaCo jTaCo can be configured to deal with various problems in evaluating grammars wrt. a corpus: • Use of different classes of NE, or different granularities (e.g. organization and subclasses company, university etc.) • Declaration of class equivalence and subclass relationships. • Extent of NE may be different (CEO Bill Gates vs. Bill Gates) • Left or right boundary may be mismatched. Size of allowable mismatch can be specified for each NE class. • Markup of corpus may be textually oriented (XML tags) while grammar output may be a different datastructure (e.g. semantics encoded in feature structure) • No general solution is possible. In case of SPROUT, feature structures are linked with input tokens, so that a correspondence can be established (under development).

Architecture of jTaCo

Conclusion • We discussed a fundamental problem in re-using heterogeneously annotated corpora for multilingual grammar development • With increasing availability of annotated corpora, re-use becomes attractive and cost-effective • We described methods and tools for re-using annotated corpora for development and evaluation of NE grammars

Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development

Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development

Presentation Transcript

Named Entity Recognition and Transliteration for 50 Languages

Named Entity Recognition

Multilingual Tools

Named Entity Classification

Named Entity Recognition

CS544: Named Entity Recognition and Classification

ACCURAT: Metrics for the evaluation of comparability of multilingual corpora

Biomedical Named Entity Recognition

Named Entity Recognition

Parallel Corpora for Multilingual Ontology Learning

Term Informativeness for Named Entity Detection

Named Entity Recognition

Named Entity Tagging

Named Entity Discovery from Multilingual Corpora

Using WordNet Predicates for Multilingual Named Entity Recognition

Using Corpora and Evaluation Tools

Unsupervised Models for Named Entity Classifcation

NAMED ENTITY RECOGNITION

Named Entity Extraction

Named Entity Recognition

Named Entity Tagging