330 likes | 349 Views
CPSC 503 Computational Linguistics. Lecture 3 Giuseppe Carenini. Subscribe to mailing list Some more Intros NLP@UBC. Introductions. Your Name Previous experience in NLP? Why are you interested in NLP?
E N D
CPSC 503Computational Linguistics Lecture 3 Giuseppe Carenini CPSC503 Winter 2008
Subscribe to mailing list • Some more Intros • NLP@UBC CPSC503 Winter 2008
Introductions • Your Name • Previous experience in NLP? • Why are you interested in NLP? • Are you thinking of NLP as your main research area? If not, what else do you want to specialize in…. • Anything else………… CPSC503 Winter 2008
NLP research at UBC TOPICS • Generation and Summarization of Evaluative Text (e.g., customer reviews) • Summarization of conversations (emails, blogs, meetings) PEOPLE: G. Carenini & R. Ng (Profs), G. Murray (Postdoc) + Students SUPPORT: NSERC, Google, BObjects(now SAP), MSResearch CPSC503 Winter 2008
Formalisms and associated Algorithms Linguistic Knowledge • State Machines (no prob.) • Finite State Automata (and Regular Expressions) • Finite State Transducers (English) Morphology Syntax Rule systems (and prob. version) (e.g., (Prob.) Context-Free Grammars) Semantics Logical formalisms (First-Order Logics) Pragmatics Discourse and Dialogue AI planners CPSC503 Winter 2008
Computational tasks in Morphology • Recognition: recognize whether a string is an English/… word (FSA) • Parsing/Generation: stem, class, lexical features …. word …. buy +V +PAST-PART bought e.g., buy +V +PAST • Stemming: stem word …. CPSC503 Winter 2008
Today Sept 15 • Finite State Transducers (FSTs) and Morphological Parsing • Stemming (Porter Stemmer) CPSC503 Winter 2008
FST definition • Q: a finiteset of states • I,O: input and an output alphabets (which may include ε) • Σ: a finite alphabet of complex symbols i:o, iI and oO • Q0: the start state • F: a set of accept/final states (FQ) • A transition relation δ that maps QxΣ to 2Q E.g., |Q| =3 ; I={a,b,c, ε} ; O={a,b}; |Σ|=?; 0 <= |δ| <= ? CPSC503 Winter 2008
FST can be used as… • Translators: input one string from I, output another from O (or vice versa) • Recognizers: input a string from IxO • Generator: output a string from IxO Terminology warning! E.g., if I={a,b,c, ε} ; O={a,b}; …… CPSC503 Winter 2008
FST: inflectional morphology of plural Some regular-nouns Notes: X -> X:X lexical:surface Some irregular-nouns o:i CPSC503 Winter 2008
Examples lexical surface m i c e lexical c a t +N +PL surface CPSC503 Winter 2008
Computational Morphology: Problems/Challenges • Ambiguity: one word can correspond to multiple structures (more critical in morphologically richer languages) • Spelling changes: may occur when two morphemes are combined e.g. butterfly + -s -> butterflies CPSC503 Winter 2008
Ambiguity: more complex example • What’s the right parse for Unionizable? • Union-ize-able • Un-ion-ize-able • Each would represent a valid path through an FST for derivational morphology. • Both Adj…… CPSC503 Winter 2008
Deal with Morphological Ambiguity • Find all the possible outputs (all paths) and return them all (without choosing) Then Part-of-speech tagging to choose…… look at the neighboring words CPSC503 Winter 2008
(2) Spelling Changes When morphemes are combined inflectionally the spelling at the boundaries may change • Examples • E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x (e.g., kiss, miss, waltz, bush, watch, rich, box) • Y-replacement: when –s or -ed are added to a word ending with a –y, -y changes to –ie or –i respectively (e.g., try, butterfly) CPSC503 Winter 2008
Solution: Multi-Tape Machines • Add intermediate tape • Use the output of one tape machine as the input to the next • Add intermediate symbols • ^ morpheme boundary • # word boundary CPSC503 Winter 2008
Multi-Level Tape Machines FST-1 FST-2 • FST-1 translates between the lexical and the intermediate level • FTS-2 handles the spelling changes (due to one rule) to the surface tape CPSC503 Winter 2008
FST-1 for inflectional morphology of plural (Lexical <-> Intermediate ) Some regular-nouns +PL:^s# # # # Some irregular-nouns o:i ε:s ε:# +PL:^ CPSC503 Winter 2008
Example lexical f o x +N +PL intemediate lexical m o u s e +N +PL intemediate CPSC503 Winter 2008
FST-2 for E-insertion(Intermediate <-> Surface) E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x …as in fox^s# <-> foxes #: ε CPSC503 Winter 2008
Examples intermediate f o x ^ s # surface intermediate b o x ^ i n g # surface CPSC503 Winter 2008
Where are we? # CPSC503 Winter 2008
Final Scheme: Part 1 CPSC503 Winter 2008
Final Scheme: Part 2 CPSC503 Winter 2008
Intersection (FST1, FST2) • States of FST1 and FST2 : Q1and Q2 • States of intersection: (Q1x Q2) • Transitions of FST1 and FST2 :δ1, δ2 • Transitions of intersection : δ3 • For all i,j,n,m,a,bδ3((q1i,q2j), a:b) = (q1n,q2m) iff • δ1(q1i, a:b) = q1n AND • δ2(q2j, a:b) = q2m a:b q1i q1n a:b a:b a:b (q1i,q2j) (q1n,q2m) q2j q2m CPSC503 Winter 2008
Composition(FST1, FST2) • States of FST1 and FST2 : Q1 and Q2 • States of composition : Q1x Q2 • Transitions of FST1 and FST2 :δ1, δ2 • Transitions of composition : δ3 • For all i,j,n,m,a,bδ3((q1i,q2j), a:b) = (q1n,q2m) iff • There exists c such that • δ1(q1i, a:c) = q1n AND • δ2(q2j, c:b) = q2m a:c q1i q1n c:b a:b a:b q2j q2m (q1i,q2j) (q1n,q2m) CPSC503 Winter 2008
FSTs in Practice • Install an FST package…… (pointers) • Describe your “formal language” (e.g, lexicon, morphotactic and rules) in a RegExp-like notation (pointer) • Your specification is compiled in a single FST Ref: “Finite State Morphology” (Beesley and Karttunen, 2003, CSLI Publications) • Complexity/Coverage: • FSTs for the morphology of a natural language may have 105 – 107 states and arcs • Spanish (1996) 46x103 stems; 3.4 x 106word forms • Arabic (2002?) 131x103 stems; 7.7 x 106word forms CPSC503 Winter 2008
Other important applications of FST in NLP From segmenting words into morphemes to… • Tokenization: • finding word boundaries in text (?!) …maxmatch • Finding sentence boundaries: punctuation… but . is ambiguous look at example in Fig. 3.22 • Shallow syntactic parsing: e.g., find only noun phrases • Phonological Rules…… (Chpt. ?11?) CPSC503 Winter 2008
Computational tasks in Morphology • Recognition: recognize whether a string is an English word (FSA) • Parsing/Generation: stem, class, lexical features …. word …. buy +V +PAST-PART bought e.g., buy +V +PAST • Stemming: stem word …. CPSC503 Winter 2008
Stemmer • E.g. the Porter algorithm, which is based on a series of sets of simple cascaded rewrite rules: • (condition) S1->S2 • ATIONAL ATE (relational relate) • (*v*) ING if stem contains vowel (motoring motor) • Cascade of rules applied to: computerization • ization -> -ize computerize • ize -> εcomputer • Errors occur: • organization organ, doing doe university universe Code freely available in most languages: Python, Java,… CPSC503 Winter 2008
Stemming mainly used in Information Retrieval • Run a stemmer on the documents to be indexed • Run a stemmer on users queries • Compute similarity between queries and documents (based on stems they contain) Seems to work especially well with smaller documents CPSC503 Winter 2008
Porter as an FST • The original exposition of the Porter stemmer did not describe it as a transducer but… • Each stage is a separate transducer • The stages can be composed to get one big transducer CPSC503 Winter 2008
Next Time • Read handout • Probability • Stats • Information theory • Next Lecture: • finish Chpt 3, 3.10-11 • Start Probabilistic Models for NLP (Chpt. 4, 4.1 – 4.2 and 5.9!) CPSC503 Winter 2008