400 likes | 526 Views
Text Analysis Meets Computational Lexicography. Hannah Kermes. Motivation. maintainance of consistency and completeness within lexica computer assisted methods lexical engineering scalable lexicographic work process processes reproducible on large amounts of text. Motivation.
E N D
Text Analysis Meets Computational Lexicography Hannah Kermes
Motivation • maintainance of consistency and completeness within lexica • computer assisted methods • lexical engineering • scalable lexicographic work process • processes reproducible on large amounts of text
Motivation • rising interest to use evidence derived from automatic syntactic analysis • statistical tools (PoS tagging etc.) and traditional chunkers do not provide enough information for corpus linguistic research • full parsers are not robust enough • need for analyzing tools that meet the specific needs of corpus linguistic studies
Information needed • syntactic information subcategorization patterns • semantic information selectional preferences, collocations, MWL • morphological information case, number, gender compounding and derivation
Hypothesis The better and more detailed the off-line annotation, the better and faster the on-line extraction. However, the more detailed the off-line annotation, the more complex the grammar, the more time consuming and difficult the grammar development, and the slower the parsing process.
Requirements for the tool • it has to work on unrestricted text • shortcomings in the grammar should not lead to a complete failure to parse • no manual checking should be required • should provide a clearly defined interface • annotation should follow linguistic standards
Requirements for the annotation • head lemma • morpho-syntactic information • lexical-semantic information • structural and textual information • hierarchical representation
Chunking Full Parsing Chunking vs. full parsing YAC • flat non-recursive structures • simple grammar • robust and efficient • non-ambiguous output • full hierarchical representation • complex grammar • not very robust • ambiguous output
Problems for extraction • Kübler and Hinrichs (2001) focused on the recognition of partial constituent structures at the level of individual chunks […], little or no attention has been paid to the question of how such partial analysis can be combined into larger structures for complete utterances.
Extended chunk definition A chunk is a continuous part of an intra-clausal constituent including recursion and pre-head as well as post-head modifiers but no PP-attachment, or sentential elements.
A classical chunker • robust – works on unrestricted text • works fully automatically • does not provide full but partial analysis of text • no highly ambiguous attachment decisions are made
YAC goes beyond • extends the chunk definition of Abney • provides additional information about annotated chunks
Technical framework - CQP • regular expression matching on token and annotation strings .*jahr • tests for membership in user specific word lists • feature set operations • constraints to specify dependencies
Perl-Scripts rule application post- processing lexicon annotation of results Applying and processing rules corpus grammar rules
Advantages of the system • efficient work even with large corpora • modular query language • interactive grammar development • powerful post-processing of rules
Annotated chunk categories • Adverbial phrases (AdvP) • Adjectival phrases (AP) • Noun phrases (NP) • Prepositional phrases (PP) • Verbal complexes (VC) • Clauses (CL)
Additional information • head lemma • morpho-syntactic information • lexical-semantic properties
Other lexical-semantic properties • VC with separated prefix: pref Er kommt an (he arrives) • PP with contracted preposition and article: fus am Bahnhof (at the station) • complex APs embedding PPs: pp über die Köpfe der Apostel gesetzten • AP with deverbal adjectives: vder
Second Level Corpus Corpus Corpus Third Level First Level Lexicon Chunking process
Chunking process • First Level • lexical information is introduced • chunks with specific internal structure are built • non-recursive chunks are built • Second Level • main parsing level • complex (recursive) structures are built in several iterations • Third Level • built chunk hierarchy
Advantages • specific rules do not interact with main parsing rules • additional (e.g. domain specific) rules can be included easily • main parsing rules can be kept simple • number of main parsing rules can be kept small
Sample query adjective + verb + finite clause VC Adjuncts* AP CL
Sample query adjective + verb + finite clause VC (AdvP|PP|NPtemp|CLrel)* APpred CLfin
Target data • predicative(-like) constructions Es war klar, daß ... It was clear, that ... • ... with adverbial pronoun Er ist davon überzeugt, daß ... He is of it convinced, that ... • ... with reflexive pronoun Es zeigt sich deutlich, daß ... It shows itself clear, that ...
Target data • ... with infinite clauses Es ist möglich, ihn zu besuchen. It is possible, him to visit. • ... with clause in topicalized position Daß ..., ist klar. That ..., is clear. Ihn zu besuchen, ist möglich. Him to visit, is possible.
Conclusion • recursive chunking workable compromise between depth of analysis and robustness • extracted data show correlation between • collocational preference • subcategorization frames • semantic classes of adjectives • to a certain extent distributional preferences