210 likes | 374 Views
CSA2050 Introduction to Computational Linguistics. Lecture 3 Examples. Course Contents. Outline. Examples in the areas of Tokenisation Morphological Analysis Tagging Syntactic Analysis. Information Extraction. raw text. tokenisation. tagged text. morphological analysis.
E N D
CSA2050 Introduction to Computational Linguistics Lecture 3 Examples
Course Contents CSA2050 - Lecture III: Examples
Outline • Examples in the areas of • Tokenisation • Morphological Analysis • Tagging • Syntactic Analysis CSA2050 - Lecture III: Examples
Information Extraction raw text tokenisation tagged text morphological analysis syntactic analysis named entity recognition CSA2050 - Lecture III: Examples
Tokenisation • The basic idea of tokenisation is to identify the basic tokens that are present in a text. • Mostly, tokens are the same as words, but not always • Why should this be a problem? John’s car cost €10,000.00. “And it’s worth every penny”, he exclaimed. CSA2050 - Lecture III: Examples
Tokenisation ProblemsPunctuation • novel forms: .net, Micro$oft, :-) • hyphenation: • linebreaks vs word-internal: e-mail, 898-0587 • multi-word: the 90-cent-an-hour raise • confusion with dash • apostrophes in contractions: we'll • periods • part of names: Amazon.com • numerical expressions: $1.99 • abbreviations, end of sentence, haplology • commas: 1,000,000 CSA2050 - Lecture III: Examples
Other Problems • Token-internal whitespace: 898 0464 • Interaction: the New York-New Haven railroad • Mixed language tokens : u • Automated language guesser • Token equivalence (when are two tokens the same)? • Case-normalization. • Sentence boundary detection. • Inconsistency: database, data-base, data base • Demo: xerox tokeniser CSA2050 - Lecture III: Examples
Morphology • Simple versus complex wordsdogdogs • Complex words formed by concatenation of morphemes. • Morpheme: The smallest unit in a word that bears some meaning, such as dog and s. CSA2050 - Lecture III: Examples
Morphological Analysis • Morphological analysis of a word involves a segmentation problem • Segmentation: discovery of the component morphemesdogs → dog + senlargement → en + large + ment • Possible ambiguities:enlargement → enlarge + ment → en + largement • Role of lexicon CSA2050 - Lecture III: Examples
Morphological Analysis John has a couple of rabbits • rabbits → rabbit + s • s indicates plural of noun rabbit • Is this the only possibility? CSA2050 - Lecture III: Examples
Morphological Analysis John rabbits on and on • rabbits → rabbit + s • s indicates 3rd person singular plural of verb rabbit • The suffix “s” is a realisation of two entirely different morphemes. • The morpheme is something more abstract than the string which realises it. CSA2050 - Lecture III: Examples
Morphological Analysis -s -a suffix world morpheme world +3S +PL CSA2050 - Lecture III: Examples
Morphological Analysis Output Analysis rabbit N PL rabbit V 3S Input Word rabbits Morphological Parser • Output is a string of morphemes • Morpheme is employed in a loose sense that • is useful for further processing CSA2050 - Lecture III: Examples
Morphological Analysis: ENGTWOL & Xerox • Atro Voutilainen, Juha Heikkilä, Timo Järvinen and Lingsoft, Inc. 1993-1995 • ENGTWOL demo • Xerox morphological analysis CSA2050 - Lecture III: Examples
Morphological Synthesis Input rabbit N PL rabbit V 3S Output Word rabbits Morphological Parser • Input is a string of morphemes • Ouput is a word CSA2050 - Lecture III: Examples
Reversibility Lookup APPLY UP> left left leave+Verb+PastBoth+123SP left left+Adv left left+Adj left left+Noun+Sg Lookdown APPLY DOWN> leave+Adj left CSA2050 - Lecture III: Examples
POS Tagging • In POS tagging, the task is to assign the most appropriate morphosyntactic label from amongst those listed in the lexicon, given the context. • John leaves presents. • Proper Names CSA2050 - Lecture III: Examples
Semantic Tagging • Named Entity Recognition • Basic idea is to recognise and tag named entities and classify them as being of type • Persons • Locations • Organisations • Named Entity Recognition - Demo CSA2050 - Lecture III: Examples
Syntactic Analysis • Problem: given sentence and grammar/lexicon, discover assigned tree structure. • XIP Parser Demo CSA2050 - Lecture III: Examples