770 likes | 955 Views
Off-line (and On-line) Text Analysis for Computational Lexicography. Hannah Kermes Algorithmische Syntax 21.12.2004. Motivation. maintainance of consistency and completeness within lexica computer assisted methods lexical engineering scalable lexicographic work process
E N D
Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004
Motivation • maintainance of consistency and completeness within lexica • computer assisted methods • lexical engineering • scalable lexicographic work process • processes reproducible on large amounts of text • statistical tools (PoS tagging etc.) and traditional chunkers do not provide enough information for corpus linguistic research • full parsers are not robust enough • need for analyzing tools that meet the specific needs of corpus linguistic studies
Information needed • syntactic information • subcategorization patterns • semantic information • selectional preferences, collocations • synonyms • multi-word units • lexical classes • morphological information • case, number, gender • compounding and derivation
Requirements for the tool • it has to work on unrestricted text • shortcomings in the grammar should not lead to a complete failure to parse • no manual checking should be required • should provide a clearly defined interface • annotation should follow linguistic standards
Requirements for the annotation • head lemma • morpho-syntactic information • lexical-semantic information • structural and textual information • hierarchical representation
Hypothesis The better and more detailed the off-line annotation, the better and faster the on-line extraction. However, the more detailed the off-line annotation, the more complex the grammar, the more time consuming and difficult the grammar development, and the slower the parsing process.
Three different dimensions • type of grammar • symbolic grammar • probabilistic grammar • type of grammar development • hand-written grammar • learning methods • depth of analysis • analysis on token level only • full parsing • partial parsing
Classical chunk definition • Abney 1991: The typical chunk consists of a single content word surrounded by a constellation of function words, matching a fixed template • Abney 1996: a non-recursive core of an intra-clausal constituent, extending from the beginning of the constituent to its head
Problems for extraction • Kübler and Hinrichs (2001) focused on the recognition of partial constituent structures at the level of individual chunks […], little or no attention has been paid to the question of how such partial analysis can be combined into larger structures for complete utterances.
An example • [PC mit kleinen ], [PC über die Köpfe ] with small above the heads [NCder Apostel ] [NC gesetzten Flammen ] the apostles set flames • [PP mit [NP[APkleinen ], [AP über [NPdie Köpfe with small above the heads [NPder Apostel ] ] gesetzten ] Flammen ]] the apostles set flames `with small flames set above the heads of the apostles´
Problems for extraction • four NCs instead of only one NP • AN-pair: • gesetzten + Flammen • kleine + Flammen • NN-pair Köpfe + Apostel needs agreement information • VN-pair setzen + Flammen needs information about the deverbal character of gesetzten • a more complex analysis is needed • PCs and NCs need to be combined
Simple solution PP PC (PC|NC)* • theoretical motivation? • rule covers this particular example, other examples might need additional rules • rule is vague and largely underspecified • not very reliable • internal structure is mainly left opague
Complex solution • NP NC NCgen • PP preposition NP • AP PP adjective • NP AP* noun
Complex solution • solution for this particular example only • large number of rules needed • rules have to be repeated for every instance of a complex phrase • in order to support extractions, the classic chunk concept has to be extended
Chunking Full Parsing YAC • full hierarchical representation • complex grammar • not very robust • ambiguous output • flat non-recursive structures • simple grammar • robust and efficient • non-ambiguous output Conclusion
A recursive chunker for unrestricted German text • recursive chunker for unrestricted German text • fully automatic analysis • main goal: provide a useful basis for extraction of linguistic as well as lexicographic information from corpora
General aspects • based on a symbolic regular expression grammar • grammar rules written in CQP • basis: • tokenization • PoS-tagging • lemmatization • agreement information Tree Tagger IMSLex
A typical chunker • robust – works on unrestricted text • works fully automatically • does not provide full but partial analysis of text • no highly ambiguous attachment decisions are made
YAC goes beyond • extends the chunk definition of Abney • recursive embedding • post-head embedding • provides additional information about annotated chunks • head lemma • agreement information • lexical-semantic and structural properties
Extended chunk definition A chunk is a continuous part of an intra-clausal constituent including recursion and pre-head as well as post-head modifiers but no PP-attachment, or sentential elements.
Perl-Scripts rule application post- processing lexicon annotation of results Technical Framework corpus grammar rules
Output formats • CQP format, used for: • interactive grammar development • parsing • extraction • an XML format, used for: • hierarchy building • extraction • data exchange
Advantages of the system • efficient work even with large corpora • modular query language • interactive grammar development • powerful post-processing of rules
Linguistic coverage • Adverbial phrases (AdvP) • schön stark(beautifully strong) • daher (from there);irgendwoher (from anywhere) • heim (home); querfeldein (cross-country) • innen (inside); überall (everywhere) • "sehr bald" (very soon) • jetzt (now); damals (at that time)
Linguistic coverage • Adjectival phrases (AP) • möglich (possible) • schreiend lila (screamingly purple) • rund zwei Meter hohe around two meter high • über die Köpfe der Apostel gesetzten above the heads of the apostles set 'set above the heads of the apostles'
Linguistic coverage • Noun phrases (NP) • Oktober (October);er (he) • 4,9 Milliarden Euro 4.9 billion Euros • "Frankensteins Fluch" "Frankenstein's curse" • kleine, über die Köpfe der Apostel gesetzten small, above the heads of the apostles set Flammen flames 'small flames set above the heads of the apostles'
Linguistic coverage • Prepositional phrases (PP) • davon (thereof) • zwischen Basel und St. Moritz between Basel and St. Moritz • mit kleinen, über die Köpfe der Apostel gesetzten with small, above the heads of the apostles set Flammen flames 'with small flames set above the heads of the apostles
Linguistic coverage • Verbal complexes (VC) • gemunkelt (rumored) • muß gerechnet werden has counted to be 'has to be counted • zu bekommen to get • bekommen zu haben gotten to have 'to have gotten'
Linguistic coverage • Clauses (CL) • … , daß selbst Ravel sich amüsiert hätte. … , that even Ravel himself enjoyed had. '… , that even Ravel would have enjoyed.' • … , die man in der griechischen Tragödie findet. … , which one in the Greek tragedy finds. '… , which one finds in the Greek tragedy.'
Linguistic coverage • Clauses (CL) • … , Instrumente selbst zu bauen. … , instruments oneself to build. ' … , to build instruments oneself.' • … , um einen Kaffee zu trinken. … , in order a coffee to drink. '… , in order to drink a coffee.'
Feature annotation • head lemma • morpho-syntactic information • lexical-semantic properties
Head lemma • lemma attribute at the head position • normally a single token • multi-word proper nouns have a multi-token head lemma • a separated verbal prefix is included in the head lemma of the VC kommt … an ankommen (arrive) • head lemma of PP: preposition:noun
Morpho-syntactic information • intersection of the morpho-syntactic information of relevant elements • invariant elements are not considered • no guessing involved to solve ambiguities
Agreement Information den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|Gen:F:Pl:Def|Gen:F:Pl:Ind|Gen:F:Sg:Def|Gen:F:Sg:Ind|Gen:M:Pl:Def|Gen:M:Pl:Ind|Gen:M:Sg:Def|Gen:M:Sg:Ind|Gen:M:Sg:Nil|Gen:N:Pl:Def|Gen:N:Pl:Ind|Gen:N:Sg:Def|Gen:N:Sg:Ind|Gen:N:Sg:Nil|Nom:F:Pl:Def|Nom:F:Pl:Ind|Nom:M:Pl:Def|Nom:M:Pl:Ind|Nom:N:Pl:Def|Nom:N:Pl:Ind|>vierten</ap_agr> <nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|Nom:M:Sg:Def|Nom:M:Sg:Ind|Nom:M:Sg:Nil|>Platz</nc_agr>
Agreement Information den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|Nom:F:Pl:Def|Nom:F:Pl:Ind|Nom:M:Pl:Def|Nom:M:Pl:Ind|Nom:N:Pl:Def|Nom:N:Pl:Ind|>vierten</ap_agr> <nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|Nom:M:Sg:Def|Nom:M:Sg:Ind|Nom:M:Sg:Nil|>Platz</nc_agr>
Agreement Information den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|>vierten</ap_agr> <nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|>Platz</nc_agr>
Agreement Information <np_agr |Akk:M:Sg:Def|> den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|>vierten</ap_agr> <nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|>Platz</nc_agr> </np_agr> <np_agr |Akk:M:Sg:Def|>
Lexical-semantic properties • important for parsing as well as for extraction • properties can be triggers for specific internal structures, functions, and usages • properties inherent in the corpus • PoS-tags Johann Sebastian Bach NE NE NE • text markers "Wilhelm Meisters Lehrjahre" NE NN NN
Lexical-semantic properties • properties determined by external knowledge sources (lexica, ontologies, word lists) • locality: hier (here);dort (there); Stuttgart • temporality: Jahr (year); damals (at that time) • derivation: gesetzten (set) deverbal adjective
Lexical-semantic properties • structural information • complex embeddings [AP[PPüber die Köpfe der Apostel ]gesetzten ] above the heads of the apostles set ' set above the heads of the apostles' [AP[NP der "Inkatha"-Partei ] angehörenden ] to the Inkatha-party belonging 'belonging to the Inkatha-party'
Other lexical-semantic properties • VC with separated prefix: pref Er kommt an(he arrives) • PP with contracted preposition and article: fus am Bahnhof(at the station) • complex APs embedding PPs: pp über die Köpfe der Apostel gesetzten above the heads of the apostles set 'set above the heads of the apostles' • AP with deverbal adjectives: vder
Second Level Corpus Corpus Corpus Third Level First Level Lexicon Chunking process
First level • basic (non-recursive) chunks • chunks with specific internal structure • Ende September (end of Semptember) • Jahre später (years later) • 21. Juli 2003 • Johann Sebastian Bach • lexical information is introduced • within the rules itself • within the Perl-scripts
Advantages • specific rules do not interact with main parsing rules • additional (e.g. domain specific) rules can be included easily • main parsing rules can be kept simple • number of main parsing rules can be kept small
Second level • main parsing level • relatively simple and general rules • AP AdvP? (PP|NP)* AC • NP Determiner? Cardinal? AP* NC • PP Preposition (NP|AdvP) • complex (recursive) structures are built in several iterations
Complexity of phrases • complexity of phrases is achieved by the embedding of complex structures rather than by complex rules • [NPeine [AP verständliche ] Sprache ] an understandable language • [NPeine [AP für den Anwender verständliche ] Sprache ] a for the user understandable language 'a language understandable for the user'