420 likes | 586 Views
Language Technology (A Machine Learning Approach) Antal van den Bosch & Walter Daelemans antalb@uvt.nl walter.daelemans@ua.ac.be http://ilk.uvt.nl/~antalb/ltua. Programme. 24/2 [Antal] Intro ML for NLP & WEKA / Decision Trees 3/3 [Walter] ML for shallow parsing
E N D
Language Technology(A Machine Learning Approach)Antal van den Bosch & Walter Daelemansantalb@uvt.nlwalter.daelemans@ua.ac.behttp://ilk.uvt.nl/~antalb/ltua
Programme • 24/2 [Antal] Intro ML for NLP & WEKA / Decision Trees • 3/3 [Walter] ML for shallow parsing • 10/3 [Antal] ML for morphology and phonology • 17/3 [Antal] ML for Information extraction • 24/3 [Antal] ML for discourse • 31/3 [Véronique] ML for coreference • 21/4 [Antal] Memory & Representation • 28/4 [Antal] Modularity / More Data • 5/5 [Walter] ML for document classification
Text Applications LT Components Lexical / Morphological Analysis OCR Spelling Error Correction Tagging Grammar Checking Chunking Information retrieval Grammatical Relation Finding Syntactic Analysis Document Classification Information Extraction Word Sense Disambiguation Summarization Named Entity Recognition Question Answering Semantic Analysis Ontology Extraction and Refinement Reference Resolution Dialogue Systems Discourse Analysis Machine Translation Meaning
Outline • Shallow Parsing • Formalisms / shallow versus deep parsing • POS Tagging • Chunking • Relation-finding • Demo http://www.cnts.ua.ac.be/cgi-bin/jmeyhi/MBSP-instant-webdemo.cgi • Assignment 1 Upload service: http://www.cnts.ua.ac.be/cgi-bin/jmeyhi/MBSP-webdemo.cgi
Formalisms for Computational Linguistics Orthography finite-state spelling rules Phonology finite-state text to speech Morphology finite-state synthesis / analysis context-free compounds Syntax context-free parsing + extensions Semantics FOPC / CD interpretation Pragmatics
Classes of grammars are differentiated by means of a number of restrictions on the type of production rule • Type-0-grammar (unrestricted rewrite system). Rules have the form • Type-1-grammar (context-sensitive). Rules are of the type , where || || • Type-2-grammar (context-free). Rules are of the form A , where e • Type-3-grammar (regular, finite). Rules are of the form A a or A aB • A grammar generates strings of L(G), an automaton accepts strings of L(M). Structure may be assigned as a side-effect.
Why parsing? • Structure of sentence determines (partially) its meaning John eats pizza with a spoon Name(John) Pizza(p) Spoon(s) Eat-action(a1) Agent(a1,John) Patient(a1,p) Instrument(a1,s) S VP PNP NP NP NP PNP VBZ NN IN DT NN John eats pizza with a spoon
HPSG: CFG: S NP VP NP DET N VP V NP Bottom up / top down DF / BF backtracking Chart parsing Lexicon Grammar Search method Input string Tree
The problem with full parsing • Vicious trade-off coverage - ambiguity • The larger the grammar (more coverage), the more spurious ambiguity
Shallow parsing • Approximate expressive power of CFG and feature-extended CFG by means of a cascade of simple transformations • Advantages • deterministic (no recursion) • efficient (1600 words per second vs. 1 word per second for a typical comparison) • accurate • robust (unrestricted text, partial solutions) • can be learned
Shallow Parsing • Steve Abney 1991 (FST) • http://www.vinartus.net/spa/ • Ramshaw & Marcus 1995 (TBL) • CoNLL Shared tasks 1999, 2000, 2001 • http://cnts.uia.ac.be/signll/shared.html • JMLR special issue 2002 • http://jmlr.csail.mit.edu/papers/special/shallow_parsing02.html
Cascade • POS tagging • NP chunking • XP chunking • Grammatical relation assignment • Function assignment • Parsing
Deductive CASS-parser (Abney, 1991) Fidditch (Hindle, 1994) Inductive Ramshaw & Marcus, 1995 Daelemans/Buchholz/Veenstra, 1999; Tjong Kim Sang, 2000 Approaches Finite-State Transformation Rules Rule-based Memory-based
Abney (1991): CASS-parser • Chunk = maximal, continuous, non-recursive syntactic segment around a head • Comparable to morphologically complex word in synthetic languages • Motivation • Linguistic (incorporate syntactic restrictions) • Psycholinguistic • Prosodic (phonological phrases)
Levels and Transformations Levels • words and their part of speech tags • chunks (kernel NP, VP, AP, AdvP) • NP D? N* N • VP V-tns | Aux V-ing • simple phrases (transforming embedding to iteration) • PP P NP • complex phrases • S PP* NP PP* VP PP*
S S NP PP VP NP VP NP P NP VP NP VP D N P D N N V-tns Pro Aux V-ing The woman in the lab coat thought you were sleeping L3 T3 L2 T2 L1 T1 L0
Pattern = category + regular expression • Regular expression is translated into FSA • For each Tiwe take the union of the FSAs to construct a recognizer for level Li • In case of more than one end state for the same input, choose the longest • In case of blocking, advance one word • “Easy-first parsing” (islands of certainty) • Extensions: add features by incorporating actions into FSAs
MBLP Cascade: shallow parsing classification disambiguation segmentation tagging chunking PP attachment relation finding
Memory-Based Learning • Basis: k nearest neighbor algorithm: • store all examples in memory • to classify a new instance X, look up the k examples in memory with the smallest distance D(X,Y) to X • let each nearest neighbor vote with its class • classify instance X with the class that has the most votes in the nearest neighbor set • Choices: • similarity metric • number of nearest neighbors (k) • voting weights
Memory-Based Learning Advantages: • Easy combination of different features • Robustness against overfitting. • Fast training and tagging with igtree But: • Weighting does not look at feature correlations, and averages over all feature values • No global optimization (yet) • Trade-off between speed and accuracy
POS tagging • Assigning morpho-syntactic categories (Parts-of-speech) to words in context: • Disambiguation: a combination of lexical and “local” contextual constraints.
POS tagging: what for? • shallow processing (abstraction from words: >recall) • basic disambiguation (choose form: >precision) • robustness, coverage, speed. • good enough for many applications: • text mining: information retrieval/extraction • corpus queries (linguistic annotation) • terminology acquisition • text-to-speech • spelling correction
POS: remaining errors • last 10-3% is hard: • long distance dependencies • genuine ambiguities • annotation errors • unknown words • not enough information in the features • more features are needed, but this has an exponential effect on data sparseness. • generalization to general text is poor: 97% 75%. • some languages: large tag sets & small corpora.
Memory-Based POS Tagger • Case base for known words. Features: • Case base for unknown words. Features: tag-2, tag-1, lexfocus, word(top100)focus, lex+1, lex+2 POS tag tag-2, tag-1, pref, cap, hyp, num, suf1, suf2, suf3, lex+1, lex+2 POS tag
Memory-Based POS Tagger • Experimental results:
NP Chunking as tagging [NP Pierre Vinken NP] , [NP 61 years NP] old , [VP will join VP] [NP the board NP] of [NP directors NP] as [NP a non-executive director NP] [NP Nov 29 NP] Pierre/I Vinken/I ,/O 61/I years/I old/O ,/O will/O join/O the/I board/I of/O directors/I as/O a/I non-executive/I director/I Nov/B 29/I ./O I Inside chunk O Outside chunk B Between chunks
Memory-Based XP Chunker Assigning non-recursive phrase brackets (Base XPs) to phrases in context: Convert NP, VP, ADJP, ADVP, PrepP, and PP brackets to classification decisions (I/O/B tags) (Ramshaw & Marcus, 1995). Features: POS -2, IOBtag-2, word -2, POS -1, IOBtag-1, word -1, POS focus, wordfocus, POS +1, word +1, POS +2, word +2, IOB tag
Memory-Based XP Chunker • Results (WSJ corpus) • One-pass segmentation and chunking for all XP • Useful for: Information Retrieval, Information Extraction, Terminology Discovery, etc.
Finding subjects and objects • Problems • One sentence can have more than one subject/object in case of more than one VP • One VP can have more than one subject/object in case of conjunctions • One NP can be linked to more than one VP • subject/verb or verb/object can be discontinuous
Task Representation • From tagged and chunked sentences, extract • Distance from verb to head in chunks • Number of VPs between verb and head • Number of commas between verb and head • Verb and its POS • Two words/chunks context to left, word + POS • One word/chunk context to right • Head itself
Memory-Based GR labeling Assigning labeled Grammatical Relation links between words in a sentence: GR’s of Focus with relation to Verbs (subject, object, location, …, none) Features: Focus: prep, adv-func, word+1, word0, word-1, word-2, POS+1, POS0, POS-1, POS-2, Chunk+1, Chunk0, Chunk-1, Chunk-2. Verb: POS, word, Distance: words, VPs, comma’s GRtype
Memory-Based GR labeling • Results (WSJ corpus) • Subjects: 83%, Objects: 87%, Locations:47%, Time:63% • Completes shallow parser. Useful for e.g. Question Answering, IE etc.
TEXT More about these projects: http://www.cnts.ua.ac.be/cnts/projects Applications Shallow parsing ANALYZED CORPUS ANALYZED SENTENCES Clustering and Pattern Matching Q-A Alignment Deletion Rules IE Rules Ontology Extraction Question Answering Summar-ization Bio-medical IE
Tokenizer (Perl) MBSP (Perl) Text In TiMBL server Lemmatizer MBT server POS Tagger TiMBL server Known words TiMBL server Relation Finder TiMBL server Unknown words MBT server Concept Tagger Timbl server Phrase Chunker TiMBL 5.0 MBT 2.0 http://ilk.uvt.nl/ TiMBL server Unknown words TiMBL server Known words
Tokenizer (Perl) MBSP (Perl) Text In TiMBL server Lemmatizer MBT server POS Tagger TiMBL server Known words TiMBL server Relation Finder TiMBL server Unknown words MBT server Concept Tagger Timbl server Phrase Chunker TiMBL 5.0 MBT 2.0 http://ilk.uvt.nl/ TiMBL server Unknown words TiMBL server Known words
What should be in? • Shallow parsing (tagging, chunking, grammatical relations) • Semantic roles • Domain semantics (NER / concept tagging) • Negation, modality, quantification can be solved as classification tasks?
Conclusions • Text Mining tasks benefit from linguistic analysis (shallow understanding). • Understanding can be formulated as a flexible heterarchy of classifiers. • These classifiers can be trained on annotated corpora.
Tagging with WEKA • Data: http://ilk.uvt.nl/~antalb/ltua/pos-eng-5000.data.arff • Cases: _ _ _ The cafeteria remains closed DT _ _ The cafeteria remains closed PERIOD NN _ The cafeteria remains closed PERIOD _ VBZ The cafeteria remains closed PERIOD _ _ JJ cafeteria remains closed PERIOD _ _ _ PERIOD
Assignment 1 • http://ilk.uvt.nl/~antalb/ltua/assignment1.html
An/DT/I-NP/O/NP-SBJ-1/an online/NN/I-NP/O/NP-SBJ-1/online demo/VBZ/I-VP/O/VP-1/demo can/MD/I-VP/O/VP-1/can be/VB/I-VP/O/VP-1/be found/VBN/I-VP/O/VP-1/find here/RB/I-ADVP/O/O/here ;/:/O/O/O/; the/DT/I-NP/O/NP-SBJ-2/the page/NN/I-NP/O/NP-SBJ-2/page features/VBZ/I-VP/O/VP-2/feature more/JJR/I-NP/O/NP-OBJ-2/more references/NNS/I-NP/O/NP-OBJ-2/reference and/CC/I-NP/O/NP-OBJ-2/and explanations/NNS/I-NP/O/NP-OBJ-2/explanation of/IN/I-PP/B-PNP/O/of the/DT/I-NP/I-PNP/O/the tagsets/NNS/I-NP/I-PNP/O/tagset and/CC/I-NP/I-PNP/O/and color/NN/I-NP/I-PNP/O/color codes/NNS/I-NP/I-PNP/O/code used/VBN/I-VP/O/VP-3/use to/TO/I-VP/O/VP-3/to encode/VB/I-VP/O/VP-3/encode the/DT/I-NP/O/NP-OBJ-3/the syntactic/JJ/I-NP/O/NP-OBJ-3/syntactic information/NN/I-NP/O/NP-OBJ-3/information ././O/O/O/. WSJ MBSP output
<S cnt="s3"> <NP rel="SBJ" of="s3_1"> <W pos="DT">An</W> <W pos="NN">online</W> </NP> <VP id="s3_1"> <W pos="NN">demo</W> <W pos="MD">can</W> <W pos="VB">be</W> <W pos="VBN">found</W> </VP> <ADVP> <W pos="RB">here</W> </ADVP> <W pos="colon">;</W> <NP rel="SBJ" of="s3_2"> <W pos="DT">the</W> <W pos="JJ">page</W> </NP> <VP id="s3_2"> <W pos="NNS">features</W> </VP> <NP rel="OBJ" of="s3_2"> <W pos="RBR">more</W> <W pos="NNS">references</W> <W pos="CC">and</W> <W pos="NNS">explanations</W> </NP> <PNP> <PP> <W pos="IN">of</W> </PP> <NP> <W pos="DT">the</W> <W pos="NNS">tagsets</W> <W pos="CC">and</W> <W pos="NN">color</W> <W pos="VBZ">codes</W> </NP> </PNP> <VP id="s3_3"> <W pos="VBN">used</W> <W pos="TO">to</W> <W pos="VB">encode</W> </VP> <NP rel="OBJ" of="s3_3"> <W pos="DT">the</W> <W pos="JJ">syntactic</W> <W pos="NN">information</W> </NP> <W pos="period">.</W> </S> BioMinT MBSP output