210 likes | 397 Views
CLINT. Tokenisation. Information Food Chain. Inference Knowledge Representation Meaning Extraction Semantic Relationships Chunking (noun phrases; verb phrases) Part of Speech Annotation Paragraph and sentence identification Tokenisation Raw Text. Start with a Corpus.
E N D
CLINT Tokenisation Introduction to Computational Linguistics
Information Food Chain Inference • Knowledge Representation • Meaning Extraction • Semantic Relationships • Chunking (noun phrases; verb phrases) • Part of Speech Annotation • Paragraph and sentence identification • Tokenisation • Raw Text Introduction to Computational Linguistics
Start with a Corpus • A corpus is an organised body of materials from language that is used as a basis for empirical studies. • Corpora classfied according to • Representativeness • Medium • Language • Information Content • Structure Introduction to Computational Linguistics
Examples of Corpora • Project Gutenberg: public domain text resources. http://www.promo.net/pg • Brown Corpus: a tagged corpus of about 1M words put together at Brown 1960-70 • Penn Treebank: a corpus of parsed sentences based on text from the WSJ • Canadian Hansards: bilingual (En Fr) corpus the Canadian parliament. Introduction to Computational Linguistics
Low Level Issues • Preprocessing: getting rid of junk such as whitespace, images, certain formatting information etc. • Normalisation: deciding on standard character representations; adopting upper or lower case (or both) • Tokenisation Introduction to Computational Linguistics
Tokenisation • Tokenisation is a process which divides input text into individual units called tokens. • Tokens are normally taken to be indivisible by the next level of analysis, but they can be associated with various kinds of information. • An example of such information is the type of the token: word, punctuation, number Introduction to Computational Linguistics
What counts as a word? • Words are quite tricky to define • The standard definition: a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes but no other punctuation marks (Kucera and Francis 1967) • It is easy to find exceptions. Introduction to Computational Linguistics
Problems Identifying Words VfB Stuttgart scored twice in quick success-ion early in the second half on their way to a deserved 2-1 victory over Manchester United in the Champions League on Wednesday.(example from Mary Dalrymple, University of London) • VfB Stuttgart, Manchester United • succession • 2-1 • Wednesday Introduction to Computational Linguistics
Problems Identifying WordsProblems Involving Spaces • Lack of spaces between wordsLebensversicherungsgesellschaftsanngesteller (life insurance company employee)Ix-Xemx • The presence of spaces may not indicate a word breakCoca Cola; +356 21 456 457 Introduction to Computational Linguistics
Problems Involving Special Characters • Words often include non-alphanumeric characters which are actually part of the word.$22.50; www.di-ve.com.mt; BSc. IT :-) • Words are often terminated by punctuation which is not part of the word. • Sometimes, terminating punctuation is part of the word. Introduction to Computational Linguistics
Periods • In general, punctuation marks attach to words, and can be removed. However there are special cases: • Most periods mark end of sentence • Others mark abbreviations, e.g. "e.g.". "Wash." • Note that when an abbreviation occurs at the end of a sentence there is only one period. Introduction to Computational Linguistics
Apostrophe • English contractions such as won't or I'll count as one word according to the classic definition • However there are reasons for wanting two separate tokens – such as interaction with grammar rules (S → NP VP) • Penn Treebank splits such contractions into two words. Introduction to Computational Linguistics
Apostrophe • This sometimes leaves odd wordsFor example isn’t yields is + n't • 's is ambiguous • Abbreviation for is (he's strange) • Possessive (John's car) • Word-final aprostrophe is ambiguous • end of quotation • possessive of word ending in s Introduction to Computational Linguistics
Exercise • How is the apostrophe used in Maltese • How should a Maltese tokeniser deal with it? Introduction to Computational Linguistics
Hyphen • Issue: do sequences of words joined by hyphens count as one word or more? • Typesetting hyphens (at end of line) and hyphens in measure phrases (35-year-old)are usually removed. • Typesetting hyphens can be ambiguous • Lexical hyphens are usually kepthi-fi • Hyphens – standing alone – are used as punctuation. • Texts are often inconsistent in usage of hyphens Introduction to Computational Linguistics
Case • Types vs. Tokens • How many tokens in the following sentence:The cat chased the rat on the table • How many types? • Tokenisation should correctly identify word types, i.e. • Tokens of the same type should be identified • Tokens of different type should be distinguished • Case representation of ordinary words must be standardised. Introduction to Computational Linguistics
Case • Heuristics • Map first character of a sentence to standard case • Map all words in titles to lowercase • Problems • Identification of sentence boundaries • Identification of proper names Introduction to Computational Linguistics
Normalisation • Character representations. • Converting all letters to lower or upper case • Removing punctuation • Removing letters with accent marks and other diacritics • Expanding abbreviations Introduction to Computational Linguistics
Further Normalisation • Stemming: are eats and eating different words? • They are two different wordforms • that have the same stem, eat, but different suffixes, -s and -ing • Stemming versus full morphological analysis. Introduction to Computational Linguistics
Summary • The tokenisation problem interacts with design decisions at different levels concerning • Handling of non alphanumeric characters • Case • Punctuation • Typically many of these problems are dealt with by hand crafting special rules which match a particular case. • Such rules are often built out of regular expressions. Introduction to Computational Linguistics
Sources Foundations of Statistical Language Processing, Manning and Schütze, MIT 1999 Introduction to Computational Linguistics