CLINT

CLINT Tokenisation Introduction to Computational Linguistics

Information Food Chain Inference • Knowledge Representation • Meaning Extraction • Semantic Relationships • Chunking (noun phrases; verb phrases) • Part of Speech Annotation • Paragraph and sentence identification • Tokenisation • Raw Text Introduction to Computational Linguistics

Start with a Corpus • A corpus is an organised body of materials from language that is used as a basis for empirical studies. • Corpora classfied according to • Representativeness • Medium • Language • Information Content • Structure Introduction to Computational Linguistics

Examples of Corpora • Project Gutenberg: public domain text resources. http://www.promo.net/pg • Brown Corpus: a tagged corpus of about 1M words put together at Brown 1960-70 • Penn Treebank: a corpus of parsed sentences based on text from the WSJ • Canadian Hansards: bilingual (En Fr) corpus the Canadian parliament. Introduction to Computational Linguistics

Low Level Issues • Preprocessing: getting rid of junk such as whitespace, images, certain formatting information etc. • Normalisation: deciding on standard character representations; adopting upper or lower case (or both) • Tokenisation Introduction to Computational Linguistics

Tokenisation • Tokenisation is a process which divides input text into individual units called tokens. • Tokens are normally taken to be indivisible by the next level of analysis, but they can be associated with various kinds of information. • An example of such information is the type of the token: word, punctuation, number Introduction to Computational Linguistics

What counts as a word? • Words are quite tricky to define • The standard definition: a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes but no other punctuation marks (Kucera and Francis 1967) • It is easy to find exceptions. Introduction to Computational Linguistics

Problems Identifying Words VfB Stuttgart scored twice in quick success-ion early in the second half on their way to a deserved 2-1 victory over Manchester United in the Champions League on Wednesday.(example from Mary Dalrymple, University of London) • VfB Stuttgart, Manchester United • succession • 2-1 • Wednesday Introduction to Computational Linguistics

Problems Identifying WordsProblems Involving Spaces • Lack of spaces between wordsLebensversicherungsgesellschaftsanngesteller (life insurance company employee)Ix-Xemx • The presence of spaces may not indicate a word breakCoca Cola; +356 21 456 457 Introduction to Computational Linguistics

Problems Involving Special Characters • Words often include non-alphanumeric characters which are actually part of the word.$22.50; www.di-ve.com.mt; BSc. IT :-) • Words are often terminated by punctuation which is not part of the word. • Sometimes, terminating punctuation is part of the word. Introduction to Computational Linguistics

Periods • In general, punctuation marks attach to words, and can be removed. However there are special cases: • Most periods mark end of sentence • Others mark abbreviations, e.g. "e.g.". "Wash." • Note that when an abbreviation occurs at the end of a sentence there is only one period. Introduction to Computational Linguistics

Apostrophe • English contractions such as won't or I'll count as one word according to the classic definition • However there are reasons for wanting two separate tokens – such as interaction with grammar rules (S → NP VP) • Penn Treebank splits such contractions into two words. Introduction to Computational Linguistics

Apostrophe • This sometimes leaves odd wordsFor example isn’t yields is + n't • 's is ambiguous • Abbreviation for is (he's strange) • Possessive (John's car) • Word-final aprostrophe is ambiguous • end of quotation • possessive of word ending in s Introduction to Computational Linguistics

Exercise • How is the apostrophe used in Maltese • How should a Maltese tokeniser deal with it? Introduction to Computational Linguistics

Hyphen • Issue: do sequences of words joined by hyphens count as one word or more? • Typesetting hyphens (at end of line) and hyphens in measure phrases (35-year-old)are usually removed. • Typesetting hyphens can be ambiguous • Lexical hyphens are usually kepthi-fi • Hyphens – standing alone – are used as punctuation. • Texts are often inconsistent in usage of hyphens Introduction to Computational Linguistics

Case • Types vs. Tokens • How many tokens in the following sentence:The cat chased the rat on the table • How many types? • Tokenisation should correctly identify word types, i.e. • Tokens of the same type should be identified • Tokens of different type should be distinguished • Case representation of ordinary words must be standardised. Introduction to Computational Linguistics

Case • Heuristics • Map first character of a sentence to standard case • Map all words in titles to lowercase • Problems • Identification of sentence boundaries • Identification of proper names Introduction to Computational Linguistics

Normalisation • Character representations. • Converting all letters to lower or upper case • Removing punctuation • Removing letters with accent marks and other diacritics • Expanding abbreviations Introduction to Computational Linguistics

Further Normalisation • Stemming: are eats and eating different words? • They are two different wordforms • that have the same stem, eat, but different suffixes, -s and -ing • Stemming versus full morphological analysis. Introduction to Computational Linguistics

Summary • The tokenisation problem interacts with design decisions at different levels concerning • Handling of non alphanumeric characters • Case • Punctuation • Typically many of these problems are dealt with by hand crafting special rules which match a particular case. • Such rules are often built out of regular expressions. Introduction to Computational Linguistics

Sources Foundations of Statistical Language Processing, Manning and Schütze, MIT 1999 Introduction to Computational Linguistics

CLINT

CLINT

Presentation Transcript

Clint Chaplin (c.chaplin@sisa.samsung)

Surety Bonding Clint Diers & Hunter Bendall

Clint Shrum Law Enforcement Challenge Coordinator

Clint beasley

Freddie C. (Clint) Waltz Jr.

Clint Eastwood - The director

Koji – Clint – Antonio – Jeannette – Rhodora – Brittany

Sequestration and Native Education Clint J. Bowers

Clint Morris Period 2B 12/16/08

JESS Updates Clint Aymond Bob Harshbarger

CLINT-CS

Precision Agriculture and Water Use Efficiency Clint Wilcox

Clint Richardson and Enric Bonmati

Clint Afternoons 3-7pm

Clint Moore, USGS Patuxent Wildlife Research Center

Caleb & Clint

Clint Eastwood

Clint Eastwood

Dr Clint Gurtman | RCP

Free Clint Lorance

Clint Best In Reading

Clint Eastwoord - do you feel lucky?

CLINT

CLINT

Presentation Transcript

Clint Chaplin (c.chaplin@sisa.samsung)

Surety Bonding Clint Diers &amp; Hunter Bendall

Clint Shrum Law Enforcement Challenge Coordinator

Clint beasley

Freddie C. (Clint) Waltz Jr.

Clint Eastwood - The director

Koji – Clint – Antonio – Jeannette – Rhodora – Brittany

Sequestration and Native Education Clint J. Bowers

Clint Morris Period 2B 12/16/08

JESS Updates Clint Aymond Bob Harshbarger

CLINT-CS

Precision Agriculture and Water Use Efficiency Clint Wilcox

Clint Richardson and Enric Bonmati

Clint Afternoons 3-7pm

Clint Moore, USGS Patuxent Wildlife Research Center

Caleb &amp; Clint

Clint Eastwood

Clint Eastwood

Dr Clint Gurtman | RCP

Free Clint Lorance

Clint Best In Reading

Clint Eastwoord - do you feel lucky?

Surety Bonding Clint Diers & Hunter Bendall

Caleb & Clint