330 likes | 462 Views
Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development. Christian Bering, Witold Drożdżyński, Gregor Erbach, Clara Guasch, Petr Homola, Sabine Lehmann, Hong Li, Hans-Ulrich Krieger, Jakub Piskorski, Ulrich Schäfer,
E N D
Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński, Gregor Erbach, Clara Guasch, Petr Homola, Sabine Lehmann, Hong Li, Hans-Ulrich Krieger, Jakub Piskorski, Ulrich Schäfer, Atsuko Shimada, Melanie Siegel, Feiyu Xu, Dorothee Ziegler-Eisele DFKI GmbH, LT Lab, Saarbrücken Saarland UniversityComputational Linguistics Dept, Saarbrücken Acrolinx GmbH, Berlin
Outline • Motivation • SPROUT – shallow processing toolkit • Multilingual NE grammar development • Shared output structures • Shared token classes • Shared grammars • Multilingual NE corpora • Evaluation tool
Motivation • Named Entity Recognition is fundamental to a number of information management applications (search engines, question answering, text mining …) • Many of these applications deal with different languages • Development of multilingual named entity grammars, supported by BMBF in the projects WHITEBOARD and COLLATE, and by the EU in the project AIRFORCE
Challenges in multilingual NER • Different alphabets, character sets and character encodings • Different tokenization conventions • Different time and currency formats • Different representations of proper names • Identical (New York, George Bush, IBM) • Different for some languages (London vs. Londres, Firenze vs. Florence vs. Florenz, NATO vs. OTAN, München vs. Munich vs. Monaco)
SProUT - Objectives • platform for the development of multilingual and domain adaptive shallow text processing and information extraction systems • trade-off between efficiency and expressiveness • modularity (fine-grained modeling of linguistic components into clear-cut modules) • portability and industrial standards
LINGUISTIC PROCESSING RESOURCES LEXICAL RESOURCES INPUT DATA JTFS STREAM OFTEXT ITEMS …. [..] [..] [..] …. SHALLOW GRAMMAR EXTENDED OPTIMIZED FST REPRES. STRUCTURED OUTPUT DATA REGULAR COMPILER FINITE-STATE TOOLKIT SHALLOW GRAMMAR INTERPRETER System Architecture G R A M M A R D E V E L O P M E N T E N V I R O N M E N T O N L I N E P R O C E S S I N G
System Components • linguistic processing resources • tokenizer (easily adaptable for indo-european languages) • gazetteer • morphology component (8 languages) • named entity recognition (6 languages) • core tools • JTFS • FSM toolkit • regular compiler • shallow grammar interpreter • tries for NLP processing
TFS and TFS-XML • TFS as data interchange format in SProUT • unification and subsumption check as basic operations for evaluation • compact XML encoding of typed feature structures (following TEI-SGML) • exchange format for linguistic resources: • grammars • feature structure tree banks • exchange format for visualization
TFS-XML: Example <FS type="pred_argument"> <F name="PRED"> <FS type=„übernehmen"/> </F> <F name="AGENT"><FS coref="1" type="argument"> <F name="NAME"> <FS type="Maria_Müller"/> </F> </FS> </F> <F name="THEME"><FS coref="2" type="argument"> <F name="NOM"> <FS type="Vorsitz"/> </F> </FS> </F> </FS>
Morphological Resources Indo-European language resources • English 200,000 entries (Mmorph (Multext)) • German 830,000 entries (Mmorph (Multext)) • French 225,000 entries (Mmorph (Multext)) • Spanish 570,000 entries (Mmorph (Multext)) • Italian 330,000 entries (Mmorph (Multext)) • Czech 600,000 entries (Institue of Formal and Applied Linguistics in Prague) Asian language resources • Chinese Shanxi-Tokenizer • Japanese ChaSen
Architecture • Mmorph fullform lexica are stored as trie • external modules (Asian and Czech) are integrated via Client/Server Tokeniser ChaSen Czech Shanxi Mmorph Parser
Unification Fully Unified Structure Extended Rule Structure After Match Matched input structure
Title of Slid COLLATE, Scientific Advisory Board Meeting, Saarland University, 22 November 2002
Multilingual Named Entity Grammars • Languages • English, French, German, Spanish • Chinese, Japanese • Grammar Style • MUC-7/MET-2 named entity classes with some variations • ENAMEX: person, location, organisation • TIMEX: time point, time span (instead of date, time) • NUMEX: percentage, money • Named entity types with internal attribute-value structures, e.g., span := timex & [FROM point, TO point ].
Multilingual NE grammar development Our approach • Shared output structures • Shared token classes • Shared grammars
Shared Output Structures • The grammars for all six languages produce the same, semantically oriented output structures, defined in TDL ne_type := sign & [DESCRIPTOR string]. enamex := ne_type. ne-person := enamex & [TITLE list-of-strings, GIVEN_NAME list-of-strings, SURNAME list-of-strings, P-POSITION list-of-strings, NAME-SUFFIX string]. ne-location := enamex & [LOCTYPE loc-type, LOCNAME string]. loc-type :< atom. river := loc-type. continent := loc-type. country := loc-type. province := loc-type. city := loc-type.
Shared Token Classes • A single set of token classes is used for the European languages NATURAL_NUMBER 12344 FLOATING_POINT_NUMBER 123,43 NUMBER_PERCENT_COMPOUND 34,4% NUMBER_DOT_COMPOUND 234.345.545. NUMBER_WORD_COMPOUND 2,4-fachen DIGIT_SLASH_COMPOUND 12/01/1998 DIGIT_DASH_COMPOUND 12-01-1998 DIGIT_COLON_COMPOUND 15:13 ALL_CAPS_WORD ABC LOWERCASE_WORD tokenization FIRST_CAPITAL_WORD Microsoft MIXED_WORD_FIRST_CAPITAL GmbH MIXED_WORD_FIRST_LOWER dKK
Shared Grammars • SPROUT supports re-use and extension of grammars • This feature has been used for the development of multilingual parallel grammars for English, Spanish and French • Common parts of the grammar for different languages (e.g. date formats like „20.10.2003“) are stored in one file, and combined with the language-specific parts of the grammars (for structures like „20 de octubre del 2003“) • Common proper names such as „Amsterdam“ are stored in generic gazeteer, while language-specific names such as „Brussels“, „Bruxelles“, „Bruselas“ are stored in language-specific lists
Advantages of shared grammars • Grammars are more easily re-usable and extendible • Consistency is improved, as changes must only be made in one place for shared structures • Grammar development is more efficient, and less time-consuming and error-prone • The same methodology has been applied for combining general-language grammars with domain-specific grammars
Re-use of corpora • We use NE-annotated corpora for grammar development and evaluation of grammars • Special-purpose annotation of corpora is only feasible for large-scale evaluations such as MUC, but exceeds the resources of most application-oriented projects • Corpora from other projects are re-used in order to save labour and have larger evaluation resources • There may be mismatches between corpus annotation and grammar output
Multilingual NE corpora • English corpora from the MUC7 evaluations • Japanese and Chinese corpora annotated according to MUC7 conventions • German corpora annotated in the COLLATE project with a superset of MUC7 annotations • German, English, French and Spanish texts annotated with Named Entities, from Joint Research Centre • Spanish data from the CoNLL-2002 Language-Independent NER task • English and French corpora from the business domain annotated with named entities according to the MUC7 guidelines within our project
Issues with re-use of corpora • The corpora contain differences in • Annotation format • Types of named entities annotated • Attributes used to describe each NE • Superficial differences in annotation format are handled by conversion to XML • Differences in the content of the annotation are not handled by modification of the corpora, but rather by making our evaluation tool more flexible
semantic relations named entities + coreference Structure of Annotated Articles <Firmenmeldung Annotator=“…” ID=“…” Status=“…”> <teiHeader> <fileDesc> <titleStmt> <author>…</author> </titleStmt> <publicationStmt> <publisher>…</publisher> <pubPlace>…</pubPlace> <date>…</date> </publicationStmt> <sourceDesc> <bibl> <agency>…</agency> <page>…</page> <topic>…</topic> <domain>…</domain> </bibl> </sourceDesc> </fileDesc> </teiHeader> <sourceText>… </sourceText> … <text> … </text> </Firmenmeldung>
Annotation of Semantic Relations Robert Bosch GmbH, Stuttgart: Der Kfz-Zulieferkonzern übernimmt zum 1. Januar die van Doorne's Transmissie b. v., Tilburg. Das niederländische Unternehmen, das im letzten Jahr mit 220 Mitarbeitern einen Umsatz von 45 Millionen DM erzielte, entwickelt stufenlose auto- matische Automobilgetriebe (CVT = Continuously Variable Transmission) und produziert Komponenten für CVT. <Firma Branche="Kfz-Zulieferkonzern" Firma="Robert Bosch" Rechtsform="GmbH" Sitz="Stuttgart"/><Firma Firma="van Doorne's Transmissie" Land="NL" Rechtsform="b. v." Sitz="Tilburg"/> <Beschaeftigung Firma="van Doorne's Transmissie" Mitarbeiter="220"/> <Umsatz Betrag="45 Mill." Firma="van Doorne's Transmissie" Waehrung="DEM"/> <Uebernahme Kaeufer="Robert Bosch" Objekt="van Doorne's Transmissie"/> • acquisition • company • corporateStructure • dividends • newBusiness • offer • occupation • premiumIncome • profit • relocation • revenue • turnover
Annotation of Named Entities Robert Bosch GmbH, Stuttgart: Der Kfz-Zulieferkonzern übernimmt zum 1. Januar die van Doorne's Transmissie b. v., Tilburg. Das niederländische Unternehmen, das im letzten Jahr mit 220 Mitarbeitern einen Umsatz von 45 Millionen DM erzielte, entwickelt stufenlose auto- matische Automobilgetriebe (CVT = Continuously Variable Transmission) und produziert Komponenten für CVT. <NE Organisation="Robert Bosch GmbH">Robert Bosch GmbH</NE> ,<NE Ort="Stuttgart">Stuttgart </NE> : Der Kfz-Zulieferkonzern übernimmt zum <NE Zeit="01.01."> 1. Januar</NE> die <NE Organisation="van Doorne's Transmissie b. v.">van Doorne's Transmissie</NE> , <NE Ort="Tilburg">Tilburg</NE> . • function • location • money • number • ordinalNumber • organization • percent • personName • productName • scaleUnit • time
Annotation of Coreference LM Ericsson AB, Stockholm: Der schwedische Elektronikkonzern hat … <expid="101">LM Ericsson AB</exp>, Stockholm: <expid="102"><ptrsrc="101"/> Der schwedische Elektronikkonzern</exp> hat … • 3rd person personal pronouns • 3rd person possessive pronouns and determiners • demonstrative pronouns and determiners • indefinite pronouns and determiners • anaphoric and cataphoric adverbs • elliptical nominal phrases • anaphoric and cataphoric nominal phrases
Cooperation: Annotation of FR TIGER: syntactic annotation LLX: FrameNet annotation <s id="s37"><graph root="s37_503"><terminals> <t id="s37_1" word="Ausgerechnet" pos="ADJD" morph="--" /> <t id="s37_2" word="Iggy" pos="NE" morph="Masc.Nom.Sg" /> <t id="s37_3" word="Pop" pos="NE" morph="*.Nom.Sg" /> <t id="s37_4" word="verkörpert" pos="VVFIN" morph="3.Sg.Pres.Ind" /> <t id="s37_5" word="gesanglich" pos="ADJD" morph="Pos" />...</terminals><nonterminals> <nt id="s37_500" cat="MPN"> <edge label="PNC" idref="s37_2"/> <edge label="PNC" idref="s37_3"/> </nt> <nt id="s37_501" cat="NP"> <edge label="NK" idref="s37_6"/> <edge label="NK" idref="s37_7"/> </nt>...</nonterminals></graph></s> <REQUEST><SPKR> SPD </SPKR> <FEE> fordert </FEE><ADD> Koalition </ADD> <MSG> zu Gespr"ach "uberReform </MSG> <FEE> auf </FEE>. </REQUEST><CONVERSATION>SPD fordert <INTLC-1> Koalition </INTLC-1> zu<FEE> Gespr"ach </FEE> <TOPIC> "uber Reform </TOPIC>auf. </CONVERSATION>
Cooperation: Multi-layer Annotation TIGER: syntactic annotation <Firmenmeldung Annotator="keku" ID="SZ_401" Status="1"> <teiHeader> <fileDesc> <titleStmt> <author/> </titleStmt> <publicationStmt> <publisher>SZ</publisher> <date>1995-03-31</date> </publicationStmt> <sourceDesc> <bibl> <agency>vwd</agency><page>22</page> <topic>Wirtschaft</topic> <domain>Firmenmeldungen</domain></bibl> </sourceDesc> </fileDesc></teiHeader> <sourceText>Datev eG, Nürnberg: Der EDV-Dienstleister für Steuerberater hat 1994 den Umsatz laut vorläufigen Zahlen um 5% auf rund 980 Mill. DM gesteigert. Die Anzahl der Mitarbeiter ist auf 4605 (4474) Beschäftigte gestiegen, die Zahl der Genossenschaftsmitglieder zog auf 34246 (33551) an. Die Investitionen von 115 (93) Mill. DM haben sich in erster Linie auf die Modernisierung derGroÃrechner, den PC-Bereich sowie auf ein automatisches Versandlager konzentriert.</sourceText> <Firma Branche1="EDV-Dienstleister für Steuerberater" Firma="Datev eG" Sitz1="Nürnberg" Rechtsform="eG"/> <Umsatz Firma="Datev eG" Differenz="5%" Trend="plus" Betrag1="980 Mill." Waehrung1="DEM" Beschreibung1="rund" Zeit="1994"/> <Beschaeftigung Firma="Datev eG" Trend="plus" Mitarbeiter1_alt="4474" Mitarbeiter1_neu="4605" Zeit="1994"/> <text> <NE Organisation="Datev eG">Datev eG</NE>, <NE Ort="Nürnberg">Nürnberg</NE>: Der EDV-Dienstleister für Steuerberater hat <NE Zeit="1994">1994</NE> den Umsatz laut vorläufigen Zahlen um <NE Prozentzahl="5%">5%</NE> auf <NE Geld="rund 980 Mill. DEM">rund 980 Mill. DM</NE> gesteigert. Die Anzahl der Mitarbeiter ist auf <NE Zahl="4605">4605</NE> (<NE Zahl="4474">4474</NE>) Beschäftigte gestiegen, die Zahl der Genossenschaftsmitglieder zog auf <NE Zahl="34246">34246</NE> (<NE Zahl="33551">33551</NE>) an. Die Investitionen von <NE Geld="115 (93) Mill. DEM">115 (93) Mill. DM</NE> haben sich in erster Linie auf die Modernisierung der GroÃrechner, den PC-Bereich sowie auf ein automatisches Versandlager konzentriert. </text></Firmenmeldung> COLLATE: semantic annotation <s id="s37"><graph root="s37_503"><terminals> <t id="s37_1" word="Ausgerechnet" pos="ADJD" morph="--" /> <t id="s37_2" word="Iggy" pos="NE" morph="Masc.Nom.Sg" /> <t id="s37_3" word="Pop" pos="NE" morph="*.Nom.Sg" /> <t id="s37_4" word="verkörpert" pos="VVFIN" morph="3.Sg.Pres.Ind" /> <t id="s37_5" word="gesanglich" pos="ADJD" morph="Pos" />...</terminals><nonterminals> <nt id="s37_500" cat="MPN"> <edge label="PNC" idref="s37_2"/> <edge label="PNC" idref="s37_3"/> </nt> <nt id="s37_501" cat="NP"> <edge label="NK" idref="s37_6"/> <edge label="NK" idref="s37_7"/> </nt>...</nonterminals></graph></s> LLX: FrameNet annotation <REQUEST><SPKR> SPD </SPKR> <FEE> fordert </FEE><ADD> Koalition </ADD> <MSG> zu Gespr"ach "uberReform </MSG> <FEE> auf </FEE>. </REQUEST><CONVERSATION>SPD fordert <INTLC-1> Koalition </INTLC-1> zu<FEE> Gespr"ach </FEE> <TOPIC> "uber Reform </TOPIC>auf. </CONVERSATION> => multi-layer annotated language resource
Evaluation Tool: jTaCo • Evaluates grammars wrt. an annotated corpus • Removes annotations from corpus, and feeds unannotated text to grammar • Compares grammar output with original annotated texts • Produces detailed statistics, evaluation scores, and diagnostic output
Configuration of jTaCo jTaCo can be configured to deal with various problems in evaluating grammars wrt. a corpus: • Use of different classes of NE, or different granularities (e.g. organization and subclasses company, university etc.) • Declaration of class equivalence and subclass relationships. • Extent of NE may be different (CEO Bill Gates vs. Bill Gates) • Left or right boundary may be mismatched. Size of allowable mismatch can be specified for each NE class. • Markup of corpus may be textually oriented (XML tags) while grammar output may be a different datastructure (e.g. semantics encoded in feature structure) • No general solution is possible. In case of SPROUT, feature structures are linked with input tokens, so that a correspondence can be established (under development).
Conclusion • We discussed a fundamental problem in re-using heterogeneously annotated corpora for multilingual grammar development • With increasing availability of annotated corpora, re-use becomes attractive and cost-effective • We described methods and tools for re-using annotated corpora for development and evaluation of NE grammars