1 / 14

Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra

Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra. Outline. TiGer Treebank TiGer Search. TIGER : Linguis T ic I nterpretation of a GER man Corpus

zita
Download Presentation

Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra

  2. Outline • TiGer Treebank • TiGer Search

  3. TIGER: LinguisTic Interpretation of a GERman Corpus Institute of Natural Language Processing (IMS) in Stuttgart, Institut für Germanistik in Potsdam, Department of Computational Linguistics and Phonetics in Saarbrücken German treebanks: Verbmobil Corpus (only spoken language), NEGRA Corpus and Tuebingen Treebank (only 20,000 sentences) The need for a large and comprehensive German treebank: Data for the testing and training of statistically based methods in natural language processing Basis for empirical language research TIGER Corpus: First release (mid 2003): 40,000 sentences of newspaper text (Frankfurter Rundschau, full articles) Second release (X-mas 2005): 50,000 sentences Together with 20,000 NEGRA sentences comparable to Penn Treebank in size (1,5 million words) The TiGer Treebank

  4. TiGer: Levels of annotation node labels: phrase categories crossing branches for discontinuous constituency types edge labels: syntactic functions S HD SB OC VP HD MO OA PP NP NP AC NK NK NK NK NK NK nächsten ADJA Sup.Dat. Sg.Neut nahe . $. Jahr NN Dat. Pl.Neut Jahr will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen Im APPRART Dat in annotation on word level: part-of-speech, morphology, lemmata

  5. Interactive tagging and parsing Tagging: TnT (97% reliable), Parsing: Cascaded Markov Models (71% reliable), Morphology: TigerMorph Independent annotation by 2 different annotators and comparison => consistency of corpus + improvement of annotation scheme Annotation time: 10 minutes per sentence TiGer: Annotation method

  6. #BOS 37 3 863207489 1 %%word tag morph edge parent AusgerechnetADJD -- MO 502 Iggy NE Masc.Nom.Sg PNC 500 Pop NE *.Nom.Sg PNC 500 verkörpert VVFIN 3.Sg.Pres.Ind HD 503 gesanglich ADJD Pos MO 503 den ART Def.Masc.Akk.SgNK 501 Staatsanwalt NN Masc.Akk.Sg.* NK 501 . $. -- -- 0 #500 MPN -- NK 502 #501 NP -- OA 503 #502 NP -- SB 503 #503 S -- -- 0 #EOS 37 TiGer: Annotation formats • Corpus annotation and storage on the basis of a MySQL database • TIGER export format in a line-oriented and ASCII based format • Separate columns for words, part-of-speech tags, morphological information, edge labels and parent labels • Encoded meta-information on date, source etc.

  7. <s id="s37"> <graph root="s37_503"> <terminals> <t id="s37_1" word="Ausgerechnet" pos="ADJD" morph="--" /> <t id="s37_2" word="Iggy" pos="NE" morph="Masc.Nom.Sg" /> <t id="s37_3" word="Pop" pos="NE" morph="*.Nom.Sg" /> <t id="s37_4" word="verk&#x00f6;rpert" pos="VVFIN" morph="3.Sg.Pres.Ind" /> <t id="s37_5" word="gesanglich" pos="ADJD" morph="Pos" /> ... </terminals> <nonterminals> <nt id="s37_500" cat="MPN"> <edge label="PNC" idref="s37_2"/> <edge label="PNC" idref="s37_3"/> </nt> <nt id="s37_501" cat="NP"> <edge label="NK" idref="s37_6"/> <edge label="NK" idref="s37_7"/> </nt> ... </nonterminals> </graph> </s> TiGer: Annotation formats #BOS 37 3 863207489 1 %%word tag morph edge parent AusgerechnetADJD -- MO 502 Iggy NE Masc.Nom.Sg PNC 500 Pop NE *.Nom.Sg PNC 500 verkörpert VVFIN 3.Sg.Pres.Ind HD 503 gesanglich ADJD Pos MO 503 den ART Def.Masc.Akk.SgNK 501 Staatsanwalt NN Masc.Akk.Sg.* NK 501 . $. -- -- 0 #500 MPN -- NK 502 #501 NP -- OA 503 #502 NP -- SB 503 #503 S -- -- 0 #EOS 37 • Corpus annotation and storage on the basis of a MySQL database • TIGER export format in a line-oriented and ASCII based format • Separate columns for words, part-of-speech tags, morphological information, edge labels and parent labels • Encoded meta-information on date, source etc. • TIGER XML document is split up into header and body • Header contains meta-information on corpus name, date, author etc. and an annotation grammar • Body: directed acyclic graphs are used as the underlying data model to encode the linguistic annotation • Element terminals contains the following attributes: word, part-of-speech, morphological tag • Element nonterminals: information on phrase categories and syntactic functions

  8. Uses a hybrid framework which combines advantages of dependency grammar and phrase structure grammar Syntactic structures are rather flat and simple in order to reduce the potential for attachment ambiguities (e.g. the distinction between arguments and adjuncts is not expressed in the constituent structure, but encoded by means of syntactic functions) Based on the NEGRA annotation scheme Changes in TIGER: improvement of linguistic adequacy extension of linguistic inventory Cross-fertilization of corpus and annotation scheme: TiGer: Annotation scheme annotation and comparison discrepancy between annotation scheme and data changes in annotation scheme, test for operationalization

  9. TiGer: Query tool • TIGERSearch: query tool for treebanks using TIGER Query Language • TIGERRegistry: format conversions into TIGER XML and indexing of the annotated corpus • TIGER Graph Viewer: visualization of query results • TIGERin: Graphical User Interface to simplify complex queries and to improve accessibility of the query language

  10. TiGer: Query language

  11. TiGer: Query language Node level: • Nodes can be described by Boolean expressions over feature-value pairs • Query: [word="lacht" & pos="VVFIN"]

  12. TiGer: Query language Node relation level: • Descriptions of two or more nodes are combined by a relation • Query: [cat="NP"] >RC [cat="S"]

  13. TiGer: Query language Graph description level: • Boolean expressions over node relations are allowed (without negation) • Query: ([cat="S"] > [pos="PRELS"]) & ([cat="S"] > [pos="VVFIN"]) • Variables can be used to express coreference of nodes or feature values • Query: (#n:[cat="S"] > [pos="PRELS"]) & (#n > [pos="VVFIN"])

  14. For further information (downloads, papers etc.):http://www.coli.uni-sb.de/cl/projects/tiger

More Related