1.02k likes | 1.21k Views
Off-line (and On-line) Text Analysis for Computational Lexicography. Hannah Kermes. Introduction. Motivation computational lexicography corpus linguistics Approaches to text analysis symbolic vs. probabilistic approaches hand-written vs. learned
E N D
Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes
Introduction • Motivation • computational lexicography • corpus linguistics • Approaches to text analysis • symbolic vs. probabilistic approaches • hand-written vs. learned • on-line queries vs. chunking vs. full parsing • Requirements • for the extraction tool • for the corpus annotation • classical chunking
Motivation • maintainance of consistency and completeness within lexica • computer assisted methods • lexical engineering • scalable lexicographic work process • processes reproducible on large amounts of text • statistical tools (PoS tagging etc.) and traditional chunkers do not provide enough information for corpus linguistic research • full parsers are not robust enough • need for analyzing tools that meet the specific needs of corpus linguistic studies
Dictonaries • for human use • printed monolingual dictionaries • electronic dictionaries • machine readable dictionaries for NLP applications
Printed monolingual dictionaries • intend to cover most important semantic and syntactic aspects • maintenance of consistency and completeness is a problem: • information is missing • entries are incomplete • information is not consistent • language changes have to be covered
Electronic dictionaries • enormous amounts of information can be stored in a compact format • search engines allow for easy and fast access to desired data • users can choose how much and what kind of information they are interested in • reference corpus as additional knowledge source
Machine readable dictionaries • NLP applications need detailed and consistent information about words • detailed morphological information • subcategorization frames of verbs, adjectives, nouns • specific syntactic information • selectional preferences • collocations • idiomatic usage
Information needed • syntactic information • subcategorization patterns • semantic information • selectional preferences, collocations • synonyms • multi-word units • lexical classes • morphological information • case, number, gender • compounding and derivation
Requirements for the tool • it has to work on unrestricted text • shortcomings in the grammar should not lead to a complete failure to parse • no manual checking should be required • should provide a clearly defined interface • annotation should follow linguistic standards
Requirements for the annotation • head lemma • morpho-syntactic information • lexical-semantic information • structural and textual information • hierarchical representation
Hypothesis The better and more detailed the off-line annotation, the better and faster the on-line extraction. However, the more detailed the off-line annotation, the more complex the grammar, the more time consuming and difficult the grammar development, and the slower the parsing process.
Three different dimensions • type of grammar • symbolic grammar • probabilistic grammar • type of grammar development • hand-written grammar • learning methods • depth of analysis • analysis on token level only • full parsing • partial parsing
Symbolic approaches • precise rules can be formulated • lexical knowledge can be included • results can be predicted and controlled • sometimes not sufficient to solve ambiguities • only phenomena which are explicit in the grammar can be dealt with
Unification-based grammars • usually complex grammars • model the hierarchical structure of language • handle attachment ambiguities • determine relations among constituents and their grammatical function • extensive use of lexical information • richness and complexity of rules do not only solve ambiguities, but produces them as well • usually large number of possible analysis
Context-free Grammars (CFG) • formal grammars consisting of a set of recursive rewriting rules • small and modular grammar • minimal interaction among rules • parsing process usually fast • covers only basic aspects of language • robustness rules are used to overcome shortcomings in the grammar
Probabilistic approaches • supervised or unsupervised training of rules • all possible analyses are produced • no need for comprehensive lexical or linguistic knowledge • rules can be left underspecified • depend on the training corpus • highly frequent phenomena are preferred over low frequent phenomena
Probabilistic context-free grammar • CFG rules enriched by probability • make use of underspecification • not as fast as CFG • special case: head lexicalized context-free grammar • unsupervised • grammar rules are indexed by the lemma of the syntactic head • extraction is performed on the rule set rather than on the annotated corpus
Hand-written rules • good control of the rule system • negative evidence can be taken into account • depends heavily on the experties of the grammar writer
Learning grammar rules • infer grammar form text corpora • extensional syntactic descriptions (annotations) are turned into intensional descriptions (rules) • optimal or suboptimal training data • new resources in the form of text corpora can be exploited • more or less independent of the knowledge of the grammar developer • depends heavily on the learning corpus • needs an annotated, well-balanced corpus
memory based learning • special case of learning • most prominent is the data oriented parsing (DOP) • fragments are stored and as such replace the grammar • language generation and analysis is performed by combining the memorized fragments • needs structurally annotated corpus • the training corpus has great impact on the performance of the system • highly sensitive to suboptimal data • needs large storage capacity
Annotation on token level • usually a form of pattern matching • completely flexible • does not depend on previous syntactic analysis • easily adaptable to different text types • full syntactic analysis has to be performed by extraction queries • queries can become rather complex • often restricted to simple contexts
Full Parsing • provides rich and detailed information about structures, relations and functions • extraction queries simply have to collect the annotated information • slow parsing speed • lack of robustness • depend heavily on prerequisite lexical information • ambiguous output
Chunking • relatively simple grammar rules • no need for extensive linguistic and lexicographic information • robust • usually non-hierarchical and non-recursive structures • annotated structures are simple and convey less information
Classical chunk definition • Abney 1991: The typical chunk consists of a single content word surrounded by a constellation of function words, matching a fixed template • Abney 1996: a non-recursive core of an intra-clausal constituent, extending from the beginning of the constituent to its head
State-of-the-art systems • CASS parser • finite-state cascades • flat, non-recursive structures • small lexicon (tag-fixes) • information about the head is given as an attribute • Conexor • symbolic constraint grammar parser • full-fedged grammar for English (ENGCG) • German: • simple, non-recursive structure • no lexical information available • head lemma indicated by a special tag
State-of-the-art systems • KaRoParse • top-down bottom-up parser • includes recursion • internal structure is flat and non-hierarchical • no agreement or lexical information • Schiehlen's chunker • symbolic context free grammar • recursion • no head lemma or lexical-semantic information • needs optimally tokenized text (including MWL recognition)
State-of-the-art systems • Chunkie • uses TnT-tagger to assign tree fragments to sequences of PoS-tags • recursion in pre-head position (maximal depth of three) • head lemma information, yet no agreement or lexical information • Cascaded Markov Models • stochastic context free grammar rules • several layers, each layer serving as input to the next • hierachical phrases, including complex recursion • head lemma information, yet no agreement or lexical information
Problems for extraction • Kübler and Hinrichs (2001) focused on the recognition of partial constituent structures at the level of individual chunks […], little or no attention has been paid to the question of how such partial analysis can be combined into larger structures for complete utterances.
An example • [PC mit kleinen ], [PC über die Köpfe ] with small above the heads [NCder Apostel ] [NC gesetzten Flammen ] the apostles set flames • [PP mit [NP[APkleinen ], [AP über [NPdie Köpfe with small above the heads [NPder Apostel ] ] gesetzten ] Flammen ]] the apostles set flames `with small flames set above the heads of the apostles´
Problems for extraction • four NCs instead of only one NP • AN-pair: • gesetzten + Flammen • kleine + Flammen • NN-pair Köpfe + Apostel needs agreement information • VN-pair setzen + Flammen needs information about the deverbal character of gesetzten • a more complex analysis is needed • PCs and NCs need to be combined
Simple solution PP PC (PC|NC)* • theoretical motivation? • rule covers this particular example, other examples might need additional rules • rule is vague and largely underspecified • not very reliable • internal structure is mainly left opague
Complex solution • NP NC NCgen • PP preposition NP • AP PP adjective • NP AP* noun
Complex solution • solution for this particular example only • large number of rules needed • rules have to be repeated for every instance of a complex phrase • in order to support extractions, the classic chunk concept has to be extended
Chunking Full Parsing YAC • full hierarchical representation • complex grammar • not very robust • ambiguous output • flat non-recursive structures • simple grammar • robust and efficient • non-ambiguous output Conclusion
Conclusion • recursive chunking workable compromise between depth of analysis and robustness • extracted data show correlation between • collocational preference • subcategorization frames • semantic classes of adjectives • to a certain extent distributional preferences
General Concept • a recursive chunker for unrestricted German text • technical framework • CWB • CQP • output formats • advantages of the architecture • general framework of YAC • linguistic coverage • feature annotation • chunking process
A recursive chunker for unrestricted German text • recursive chunker for unrestricted German text • fully automatic analysis • main goal: provide a useful basis for extraction of linguistic as well as lexicographic information from corpora
General aspects • based on a symbolic regular expression grammar • grammar rules written in CQP • basis: • tokenization • PoS-tagging • lemmatization • agreement information Tree Tagger IMSLex
A typical chunker • robust – works on unrestricted text • works fully automatically • does not provide full but partial analysis of text • no highly ambiguous attachment decisions are made
YAC goes beyond • extends the chunk definition of Abney • recursive embedding • post-head embedding • provides additional information about annotated chunks • head lemma • agreement information • lexical-semantic and structural properties
Extended chunk definition A chunk is a continuous part of an intra-clausal constituent including recursion and pre-head as well as post-head modifiers but no PP-attachment, or sentential elements.
Perl-Scripts rule application post- processing lexicon annotation of results Technical Framework corpus grammar rules
Technical framework - CQP • regular expression matching on token and annotation strings • tests for membership in user specific word lists • feature set operations • constraints to specify dependencies
Perl-Scripts • invocation of CQP • processing of the results • annotation of the results into the corpus
Postprocessing • values can be checked • values can be changed • values can be compared • range of structures can be changed
Output formats • CQP format, used for: • interactive grammar development • parsing • extraction • an XML format, used for: • hierarchy building • extraction • data exchange
Advantages of the system • efficient work even with large corpora • modular query language • interactive grammar development • powerful post-processing of rules
Linguistic coverage • Adverbial phrases (AdvP) • schön stark(beautifully strong) • daher (from there);irgendwoher (from anywhere) • heim (home); querfeldein (cross-country) • innen (inside); überall (everywhere) • "sehr bald" (very soon) • jetzt (now); damals (at that time)
Linguistic coverage • Adjectival phrases (AP) • möglich (possible) • schreiend lila (screamingly purple) • rund zwei Meter hohe around two meter high • über die Köpfe der Apostel gesetzten above the heads of the apostles set 'set above the heads of the apostles'