Text Analysis Meets Computational Lexicography

Text Analysis Meets Computational Lexicography Hannah Kermes

Motivation • maintainance of consistency and completeness within lexica • computer assisted methods • lexical engineering • scalable lexicographic work process • processes reproducible on large amounts of text

Motivation • rising interest to use evidence derived from automatic syntactic analysis • statistical tools (PoS tagging etc.) and traditional chunkers do not provide enough information for corpus linguistic research • full parsers are not robust enough • need for analyzing tools that meet the specific needs of corpus linguistic studies

Information needed • syntactic information subcategorization patterns • semantic information selectional preferences, collocations, MWL • morphological information case, number, gender compounding and derivation

A corpus linguistic approach

Hypothesis The better and more detailed the off-line annotation, the better and faster the on-line extraction. However, the more detailed the off-line annotation, the more complex the grammar, the more time consuming and difficult the grammar development, and the slower the parsing process.

Requirements for the tool • it has to work on unrestricted text • shortcomings in the grammar should not lead to a complete failure to parse • no manual checking should be required • should provide a clearly defined interface • annotation should follow linguistic standards

Requirements for the annotation • head lemma • morpho-syntactic information • lexical-semantic information • structural and textual information • hierarchical representation

Chunking Full Parsing Chunking vs. full parsing YAC • flat non-recursive structures • simple grammar • robust and efficient • non-ambiguous output • full hierarchical representation • complex grammar • not very robust • ambiguous output

A classical chunker • robust – works on unrestricted text • works fully automatically • does not provide full but partial analysis of text • no highly ambiguous attachment decisions are made

YAC goes beyond • extends the chunk definition of Abney • provides additional information about annotated chunks

Perl-Scripts rule application post- processing lexicon annotation of results Applying and processing rules corpus grammar rules

Advantages of the system • efficient work even with large corpora • modular query language • interactive grammar development • powerful post-processing of rules

Annotated chunk categories • Adverbial phrases (AdvP) • Adjectival phrases (AP) • Noun phrases (NP) • Prepositional phrases (PP) • Verbal complexes (VC) • Clauses (CL)

Additional information • head lemma • morpho-syntactic information • lexical-semantic properties

Feature annotation

Some properties of NPs

Other lexical-semantic properties • VC with separated prefix: pref Er kommt an (he arrives) • PP with contracted preposition and article: fus am Bahnhof (at the station) • complex APs embedding PPs: pp über die Köpfe der Apostel gesetzten • AP with deverbal adjectives: vder

Target data • predicative(-like) constructions Es war klar, daß ... It was clear, that ... • ... with adverbial pronoun Er ist davon überzeugt, daß ... He is of it convinced, that ... • ... with reflexive pronoun Es zeigt sich deutlich, daß ... It shows itself clear, that ...

Target data • ... with infinite clauses Es ist möglich, ihn zu besuchen. It is possible, him to visit. • ... with clause in topicalized position Daß ..., ist klar. That ..., is clear. Ihn zu besuchen, ist möglich. Him to visit, is possible.

Sample query adjective + verb + finite clause  VC AP CL

Sample query adjective + verb + finite clause  VC APpred CLfin

Sample query adjective + verb + finite clause  VC Adjuncts* APpred CLfin

Sample query adjective + verb + finite clause  VC (AdvP|PP|NPtemp|CLrel)* APpred CLfin

adjective + verb + finite clause

Topicalized finite clause adjective + verb + finite clause  CLfin VC (AdvP|PP|NPtemp|CLrel)* APpred

adjective + verb + finite clause

adjective + verb + infinite clause

low freq adj + verb + infin clause

low freq adj + verb + clause

Conclusion • recursive chunking workable compromise between depth of analysis and robustness • extracted data show correlation between • collocational preference • subcategorization frames • semantic classes of adjectives • to a certain extent distributional preferences

Evaluation on automatic PoS-tags

Evaluation on ideal PoS-tags

Second Level Corpus Corpus Corpus Third Level First Level Lexicon Chunking process

Chunking process • First Level • lexical information is introduced • chunks with specific internal structure are built • non-recursive chunks are built • Second Level • main parsing level • complex (recursive) structures are built in several iterations • Third Level • built chunk hierarchy

Rule blocks

Advantages • specific rules do not interact with main parsing rules • additional (e.g. domain specific) rules can be included easily • main parsing rules can be kept simple • number of main parsing rules can be kept small

Text Analysis Meets Computational Lexicography