1 / 31

CPSC 503 Computational Linguistics

CPSC 503 Computational Linguistics. Lecture 3 Giuseppe Carenini. NLP research at UBC. http://people.cs.ubc.ca/~rjoty/Webpage/. TOPICS Generation and Summarization of Evaluative Text (e.g., customer reviews) Summarization of conversations (emails, blogs, meetings)

vinsonj
Download Presentation

CPSC 503 Computational Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CPSC 503Computational Linguistics Lecture 3 Giuseppe Carenini CPSC503 Winter 2009

  2. NLP research at UBC http://people.cs.ubc.ca/~rjoty/Webpage/ TOPICS • Generation and Summarization of Evaluative Text (e.g., customer reviews) • Summarization of conversations (emails, blogs, meetings) • Subjectivity Detection, Domain Adaptation, Rhetorical Parsing PEOPLE: G. Carenini & R. Ng (Profs), G. Murray (Postdoc) + Students SUPPORT: NSERC, Google, BObjects(now SAP), COLLABORATIONS: MSResearch CPSC503 Winter 2009

  3. Formalisms and associated Algorithms Linguistic Knowledge • State Machines (no prob.) • Finite State Automata (and Regular Expressions) • Finite State Transducers (English) Morphology Syntax Rule systems (and prob. version) (e.g., (Prob.) Context-Free Grammars) Semantics Logical formalisms (First-Order Logics) Pragmatics Discourse and Dialogue AI planners CPSC503 Winter 2009

  4. Computational tasks in Morphology • Recognition: recognize whether a string is an English/… word (FSA) • Parsing/Generation: stem, class, lexical features …. word …. buy +V +PAST-PART bought e.g., buy +V +PAST • Stemming: stem word …. CPSC503 Winter 2009

  5. Today Sept 16 • Finite State Transducers (FSTs) and Morphological Parsing • Stemming (Porter Stemmer) CPSC503 Winter 2009

  6. FST definition (Recap.) • Q: a finiteset of states • I,O: input and an output alphabets (which may include ε) • Σ: a finite alphabet of complex symbols i:o, iI and oO • Q0: the start state • F: a set of accept/final states (FQ) • A transition relation δ that maps QxΣ to 2Q E.g., |Q| =3 ; I={a,b,c, ε} ; O={a,b}; |Σ|=?; 0 <= |δ| <= ? CPSC503 Winter 2009

  7. FST can be used as… • Translators: input one string from I, output another from O (or vice versa) • Recognizers: input a string from IxO • Generator: output a string from IxO Terminology warning! E.g., if I={a,b} ; O={a,b,ε}; …… CPSC503 Winter 2009

  8. FST: inflectional morphology of plural Some regular-nouns Notes: X -> X:X lexical:surface Some irregular-nouns o:i CPSC503 Winter 2009

  9. Examples lexical surface m i c e lexical c a t +N +PL surface CPSC503 Winter 2009

  10. Computational Morphology: Problems/Challenges • Ambiguity: one word can correspond to multiple structures (more critical in morphologically richer languages) • Spelling changes: may occur when two morphemes are combined e.g. butterfly + -s -> butterflies CPSC503 Winter 2009

  11. Ambiguity: more complex example • What’s the right parse for Unionizable? • Union-ize-able • Un-ion-ize-able • Each would represent a valid path through an FST for derivational morphology. • Both Adj…… CPSC503 Winter 2009

  12. Deal with Morphological Ambiguity • Find all the possible outputs (all paths) and return them all (without choosing) Then Part-of-speech tagging to choose…… look at the neighboring words CPSC503 Winter 2009

  13. (2) Spelling Changes When morphemes are combined inflectionally the spelling at the boundaries may change • Examples • E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x (e.g., kiss, miss, waltz, bush, watch, rich, box) • Y-replacement: when –s or -ed are added to a word ending with a –y, -y changes to –ie or –i respectively (e.g., butterfly, try) CPSC503 Winter 2009

  14. Solution: Multi-Tape Machines • Add intermediate tape • Use the output of one tape machine as the input to the next • Add intermediate symbols • ^ morpheme boundary • # word boundary CPSC503 Winter 2009

  15. Multi-Level Tape Machines FST-1 FST-2 • FST-1 translates between the lexical and the intermediate level • FTS-2 handles the spelling changes (due to one rule) to the surface tape CPSC503 Winter 2009

  16. FST-1 for inflectional morphology of plural (Lexical <-> Intermediate ) Some regular-nouns +PL:^s# # # # Some irregular-nouns o:i ε:s ε:# +PL:^ CPSC503 Winter 2009

  17. Example lexical f o x +N +PL intemediate lexical m o u s e +N +PL intemediate CPSC503 Winter 2009

  18. FST-2 for E-insertion(Intermediate <-> Surface) E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x …as in fox^s# <-> foxes #: ε CPSC503 Winter 2009

  19. Examples intermediate f o x ^ s # surface intermediate b o x ^ i n g # surface CPSC503 Winter 2009

  20. Where are we? # CPSC503 Winter 2009

  21. Final Scheme: Part 1 CPSC503 Winter 2009

  22. Final Scheme: Part 2 CPSC503 Winter 2009

  23. Intersection (FST1, FST2) = FST3 • States of FST1 and FST2 : Q1and Q2 • States of intersection: (Q1x Q2) • Transitions of FST1 and FST2 :δ1, δ2 • Transitions of intersection : δ3 • For all i,j,n,m,a,bδ3((q1i,q2j), a:b) = (q1n,q2m) iff • δ1(q1i, a:b) = q1n AND • δ2(q2j, a:b) = q2m a:b q1i q1n a:b a:b a:b (q1i,q2j) (q1n,q2m)  q2j q2m CPSC503 Winter 2009

  24. Composition(FST1, FST2) = FST3 • States of FST1 and FST2 : Q1 and Q2 • States of composition : Q1x Q2 • Transitions of FST1 and FST2 :δ1, δ2 • Transitions of composition : δ3 • For all i,j,n,m,a,bδ3((q1i,q2j), a:b) = (q1n,q2m) iff • There exists c such that • δ1(q1i, a:c) = q1n AND • δ2(q2j, c:b) = q2m a:c q1i q1n c:b a:b a:b  q2j q2m (q1i,q2j) (q1n,q2m) CPSC503 Winter 2009

  25. FSTs in Practice • Install an FST package…… (pointers) • Describe your “formal language” (e.g, lexicon, morphotactic and rules) in a RegExp-like notation (pointer) • Your specification is compiled in a single FST Ref: “Finite State Morphology” (Beesley and Karttunen, 2003, CSLI Publications) • Complexity/Coverage: • FSTs for the morphology of a natural language may have 105 – 107 states and arcs • Spanish (1996) 46x103 stems; 3.4 x 106word forms • Arabic (2002?) 131x103 stems; 7.7 x 106word forms CPSC503 Winter 2009

  26. Other important applications of FST in NLP From segmenting words into morphemes to… • Tokenization: • finding word boundaries in text (?!) …maxmatch • Finding sentence boundaries: punctuation… but . is ambiguous look at example in Fig. 3.22 • Shallow syntactic parsing: e.g., find only noun phrases • Phonological Rules…… (Chpt. 11) CPSC503 Winter 2009

  27. Computational tasks in Morphology • Recognition: recognize whether a string is an English word (FSA) • Parsing/Generation: stem, class, lexical features …. word …. buy +V +PAST-PART bought e.g., buy +V +PAST • Stemming: stem word …. CPSC503 Winter 2009

  28. Stemmer • E.g. the Porter algorithm, which is based on a series of sets of simple cascaded rewrite rules: • (condition) S1->S2 • ATIONAL  ATE (relational  relate) • (*v*) ING   if stem contains vowel (motoring  motor) • Cascade of rules applied to: computerization • ization -> -ize computerize • ize -> εcomputer • Errors occur: • organization  organ, university  universe Code freely available in most languages: Python, Java,… CPSC503 Winter 2009

  29. Stemming mainly used in Information Retrieval • Run a stemmer on the documents to be indexed • Run a stemmer on users queries • Compute similarity between queries and documents (based on stems they contain) Seems to work especially well with smaller documents CPSC503 Winter 2009

  30. Porter as an FST • The original exposition of the Porter stemmer did not describe it as a transducer but… • Each stage is a separate transducer • The stages can be composed to get one big transducer CPSC503 Winter 2009

  31. Next Time • Read handout • Probability • Stats • Information theory • Next Lecture: • finish Chpt 3, 3.10-11 • Start Probabilistic Models for NLP (Chpt. 4, 4.1 – 4.2 and 5.9!) CPSC503 Winter 2009

More Related