110 likes | 236 Views
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 9 (03/02/06) Prof. Pushpak Bhattacharyya IIT Bombay Dealing With Corpus. Sources of the Corpus. Mainly in US & Europe LDC – www.ldc.upenn.edu ELRA – www.elra.info Oxford Text Archive – www.ota.ahds.ac.uk
E N D
CS460/IT632Natural Language Processing/Language Technology for the WebLecture 9 (03/02/06)Prof. Pushpak BhattacharyyaIIT BombayDealing With Corpus
Sources of the Corpus • Mainly in US & Europe • LDC – www.ldc.upenn.edu • ELRA – www.elra.info • Oxford Text Archive – www.ota.ahds.ac.uk • Brown Corpus (1961) – American English (1 million words) • British National Corpus (BNC) – British English (10 million words) Prof. Pushpak Bhattacharyya, IIT Bombay
Dealing with Corpus - Challenges • Word recognition (Tokenization) • Sentence recognition Prof. Pushpak Bhattacharyya, IIT Bombay
Problems in Tokenization • Uppercase/lowercase • Uppercase may be a proper noun/sentence beginning/emphasizer/Title • I told you to SUBMIT the report. (emphasizer) • Proper Nouns – Named entity detection • Dates – they have a non-standard format • 2/8/2006 • 8 February 2006 • 8-Feb-06 • February 8, 2006 & many more Prof. Pushpak Bhattacharyya, IIT Bombay
Problems in Tokenization (Contd.1) • Phone Numbers – non-standard format • 25767718 • 2576 7718 • 22-25767718 • 022-25767718 • 022.25767718 • 91-22-25767718 • 01711380647 (UK format) • (44-171)8301007 (UK format) • +45 43 48606 (Denmark format) • (94-1)866854 (Sri Lanka format) Prof. Pushpak Bhattacharyya, IIT Bombay
Problems in Tokenization (Contd.2) • Periods (full stops) – Its roles are sentence delimiter or abbreviations. Given are some examples. • U.N.O. stopped aid to Afghanistan. • Ex. • Apostrophe – genitive or shortening device • Ram’s (genitive) brother isn’t (shortening) well today. Haplology – Multiple rules played by a punctuation mark. Prof. Pushpak Bhattacharyya, IIT Bombay
Precision & Recall (False Hit & False Miss) Precision = size (Actual Hypothesis) size (Hypothesis) Recall = size (Actual Hypothesis) size (Actual) Actual Set Hypothesis Set False Miss False Hit Prof. Pushpak Bhattacharyya, IIT Bombay
Further Challenges • Hyphen – can occur in a compound word or as a word continuer. • Play-mate • I would like to watch the game play- fully. • Conjoining & Compounding (Sandhii & Samaas) • Vidhyaalaya = vidhyaa + aalay (sandhii) • Raajaputra = rajaa + putra (samaas) Prof. Pushpak Bhattacharyya, IIT Bombay
Multiword Recognition • A multiword is a single token composed of words separated by blanks. • United Nation Organization (proper noun) • Golf club, cricket bat (common nouns) • What kind of relationship occur between the words of the multiwords or compounds. • Raajaarshi = raajaa + rishi • Meaning – A king (raajaa) who has qualities of a saint (rishi) also • Raajaputra = rajaa + putra • Meaning – son (putra) of the king (raajaa) Prof. Pushpak Bhattacharyya, IIT Bombay
Multiword (Contd.) • There are NOUN + VERB combinations – • sit down • jamhaayii lenaa (to yawn) • gir paRnaa (to fall down) • A typical multiword in German – • Donau|dampf|schiff|ahtrs|gesellschafts|kapitans|mitzen|fabrikant • Danube|steam|ship|voyage|company|captain|cap|producer (Gloss in English) NOTE: The character ‘|’ is inserted to show the different constituents of the compound word. Otherwise the word is written without any ‘|’ in between. Prof. Pushpak Bhattacharyya, IIT Bombay
Techniques for Parsing • There are different techniques for parsing - • Top Down Parsing • Bottom Up (Bottom Up Chart Parsing) • Top Down & Bottom Up (Top Down Chart Parsing) • Parsing can be deterministic or probabilistic • It can be based on phrase structured grammar or dependency grammar. Prof. Pushpak Bhattacharyya, IIT Bombay