1 / 11

Sources of the Corpus

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 9 (03/02/06) Prof. Pushpak Bhattacharyya IIT Bombay Dealing With Corpus. Sources of the Corpus. Mainly in US & Europe LDC – www.ldc.upenn.edu ELRA – www.elra.info Oxford Text Archive – www.ota.ahds.ac.uk

taima
Download Presentation

Sources of the Corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS460/IT632Natural Language Processing/Language Technology for the WebLecture 9 (03/02/06)Prof. Pushpak BhattacharyyaIIT BombayDealing With Corpus

  2. Sources of the Corpus • Mainly in US & Europe • LDC – www.ldc.upenn.edu • ELRA – www.elra.info • Oxford Text Archive – www.ota.ahds.ac.uk • Brown Corpus (1961) – American English (1 million words) • British National Corpus (BNC) – British English (10 million words) Prof. Pushpak Bhattacharyya, IIT Bombay

  3. Dealing with Corpus - Challenges • Word recognition (Tokenization) • Sentence recognition Prof. Pushpak Bhattacharyya, IIT Bombay

  4. Problems in Tokenization • Uppercase/lowercase • Uppercase may be a proper noun/sentence beginning/emphasizer/Title • I told you to SUBMIT the report. (emphasizer) • Proper Nouns – Named entity detection • Dates – they have a non-standard format • 2/8/2006 • 8 February 2006 • 8-Feb-06 • February 8, 2006 & many more Prof. Pushpak Bhattacharyya, IIT Bombay

  5. Problems in Tokenization (Contd.1) • Phone Numbers – non-standard format • 25767718 • 2576 7718 • 22-25767718 • 022-25767718 • 022.25767718 • 91-22-25767718 • 01711380647 (UK format) • (44-171)8301007 (UK format) • +45 43 48606 (Denmark format) • (94-1)866854 (Sri Lanka format) Prof. Pushpak Bhattacharyya, IIT Bombay

  6. Problems in Tokenization (Contd.2) • Periods (full stops) – Its roles are sentence delimiter or abbreviations. Given are some examples. • U.N.O. stopped aid to Afghanistan. • Ex. • Apostrophe – genitive or shortening device • Ram’s (genitive) brother isn’t (shortening) well today. Haplology – Multiple rules played by a punctuation mark. Prof. Pushpak Bhattacharyya, IIT Bombay

  7. Precision & Recall (False Hit & False Miss) Precision = size (Actual  Hypothesis) size (Hypothesis) Recall = size (Actual  Hypothesis) size (Actual) Actual Set Hypothesis Set False Miss False Hit Prof. Pushpak Bhattacharyya, IIT Bombay

  8. Further Challenges • Hyphen – can occur in a compound word or as a word continuer. • Play-mate • I would like to watch the game play- fully. • Conjoining & Compounding (Sandhii & Samaas) • Vidhyaalaya = vidhyaa + aalay (sandhii) • Raajaputra = rajaa + putra (samaas) Prof. Pushpak Bhattacharyya, IIT Bombay

  9. Multiword Recognition • A multiword is a single token composed of words separated by blanks. • United Nation Organization (proper noun) • Golf club, cricket bat (common nouns) • What kind of relationship occur between the words of the multiwords or compounds. • Raajaarshi = raajaa + rishi • Meaning – A king (raajaa) who has qualities of a saint (rishi) also • Raajaputra = rajaa + putra • Meaning – son (putra) of the king (raajaa) Prof. Pushpak Bhattacharyya, IIT Bombay

  10. Multiword (Contd.) • There are NOUN + VERB combinations – • sit down • jamhaayii lenaa (to yawn) • gir paRnaa (to fall down) • A typical multiword in German – • Donau|dampf|schiff|ahtrs|gesellschafts|kapitans|mitzen|fabrikant • Danube|steam|ship|voyage|company|captain|cap|producer (Gloss in English) NOTE: The character ‘|’ is inserted to show the different constituents of the compound word. Otherwise the word is written without any ‘|’ in between. Prof. Pushpak Bhattacharyya, IIT Bombay

  11. Techniques for Parsing • There are different techniques for parsing - • Top Down Parsing • Bottom Up (Bottom Up Chart Parsing) • Top Down & Bottom Up (Top Down Chart Parsing) • Parsing can be deterministic or probabilistic • It can be based on phrase structured grammar or dependency grammar. Prof. Pushpak Bhattacharyya, IIT Bombay

More Related