1 / 72

CMCS723/LING723 Computational Linguistics I

CMCS723/LING723 Computational Linguistics I. Language of the subconscious, by WildCherry. - Saif Mohammad. The instruction team. Instructor: Saif Mohammad Co-instructor: Nitin Madnani Coordinator: Professor Bonnie Dorr Teaching Assistant: Sajib Dasgupta. The instruction team.

benoit
Download Presentation

CMCS723/LING723 Computational Linguistics I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMCS723/LING723 Computational Linguistics I Language of the subconscious, by WildCherry - Saif Mohammad

  2. The instruction team • Instructor: Saif Mohammad • Co-instructor: Nitin Madnani • Coordinator: Professor Bonnie Dorr • Teaching Assistant: Sajib Dasgupta

  3. The instruction team • Instructor: Saif Mohammad • Co-instructor: Nitin Madnani • Coordinator: Professor Bonnie Dorr • Teaching Assistant: Sajib Dasgupta • Guest Lectures: • Bonnie Dorr • Philip Resnik • Doug Oard

  4. You (pre-requisites) • Competent programmers

  5. You (pre-requisites) • Competent programmers • Do not have to be linguists • Have high-school English behind you • Know parts of speech, syntactic parse trees, subject, object,… • Read material on word classes and context-free grammars from J&M chapters 5 and 12 for background

  6. Administrivia • Text: • Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics, second edition (published in 2008), by Daniel Jurafsky and James H. Martin. • Course webpage: • http://www.umiacs.umd.edu/~saif/WebPages/CS723.htm • Class: • Wednesdays, 4 to 6:30pm (5--10 min break in between)

  7. Course grade • Exams: 50% • midterm exam: 25% • final exam: 25% • Class assignments/projects: 45% • Assignment 1 through 4: 10%, 12.5%, 10%, 12.5% • Assignment 0: no credit • designed to calibrate programming skills • Class participation: 5% • Showing up for class, demonstrating preparedness, and contributing to class discussions.

  8. Out-of-class support • Office hours: • Saif: by appointment • Sajib: TA room 1112 • Mondays: 4 to 5:30 pm • Tuesdays: 2 to 3:30 pm • Forum: • https://forum.cs.umd.edu/forumdisplay.php?f=113

  9. Nitin’s Role • Focus on Statistical Models • HMMs, EM, N-gram LMs, TAGs (approx. 4 lectures) • Assignments • All written in Python/NLTK • Python/NLTK tutorial next week (show up!) • Assignment 0 (not for credit) • Purpose: Introspection and Practice • Try to solve problem 1 before tutorial next week, problem 2 after

  10. Nitin’s Role • Forums • Register unless already registered for another class • Preferred way to ask questions • Feel free to start discussion threads, if necessary • Subscribe to notifications!

  11. What is Computational Linguistics? • Study of computer processing, understanding, and generation of human languages • Interdisciplinary field • Linguistics, machine learning and artificial intelligence, statistics, cognitive science, psychology, and others • Common applications: • Machine translation, information retrieval, text summarization, question answering

  12. Overview and History of Computational Linguistics Professor Bonnie Dorr

  13. Introduction to Statistical Natural Language Processing

  14. Practical NLP system • Disambiguation decisions of word sense, word category, syntactic structure,… • Maximize coverage, minimize errors (false positives) • Robust • Generalize well

  15. Traditional NLP • AI approaches with deep understanding had hand-coded rules • Creating the rules is time-consuming • One may miss rules; sometimes the rules are too many to encode • May not scale to different domains • Brittle (metaphors) I swallowed his story

  16. Statistical NLP • Counting things • Determining patterns that occur in language use • Features: • Learn rules, patterns automatically • Statistical models are robust, generalize well, and behave gracefully when faced with less-than-perfect conditions

  17. Corpus-based NLP • Corpus: a collection of natural language documents • British National Corpus, Wall Street journal, google’s web-indexed corpus, switch-board corpus • Can we learn how language works from this text? • Look for patterns in the corpus

  18. Features of a corpus • Size • Balanced or domain-specific • Written or spoken • Raw or annotated (senses, pos, structure) • Electronically available or hard copy • Free to use or one needs to pay for a license

  19. More corpora • Brown • Susanne • Penn Treebank • Canadian Hansards

  20. Other lexical resources • Dictionaries • Gloss, example sentence • Thesauri • categories, paragraphs, semicolon units • WordNet • synsets, gloss • hypernyms, holonyms, troponyms

  21. Getting our hands dirty.

  22. Lets pick up a book.

  23. What are the most frequent words? Tom Sawyer

  24. What are the most frequent words? Tom Sawyer the333 determiner (article) and2972 conjunction a1775 determiner to1725 preposition, verbal infinitive marker of1440 preposition was1161 auxiliary verb it1027 (personal/expletive) pronoun in906 preposition

  25. How many words are there? Tom Sawyer • Tokens: 71,370 • Types: 8,018 • Memory: half a megabyte • Average frequency of a word • # tokens / # types = 8.9

  26. The distribution of words freq freq of freq 1 3993 2 1292 3 664 4 410 5 243 6 199 7 172 Tom Sawyer freq freq of freq 8 131 9 82 10 91 11–50 540 51–100 99 > 100 102

  27. The distribution of words • Hapax legomena • word types that occur only once in the corpus

  28. The distribution of words • Hapax legomena • word types that occur only once in the corpus • Direct applications of simple word counts • cryptography, style of authorship • Indirectly, counts are used pervasively in NLP

  29. The distribution of words • Hapax legomena • word types that occur only once in the corpus • Direct applications of simple word counts • cryptography, style of authorship • Indirectly, counts are used pervasively in NLP • Why is statistical NLP difficult? • hard to predict much about the behavior of words that occur rarely (if at all)

  30. Human Behavior and the Principle of Least Effort • The Principle of Least Effort: “people will act so as to minimize their probable average rate of work” • Evidence: • Underlying statistical distributions in language • Count up words in a corpus • List (rank) words in order of frequency

  31. Zipf’s law • frequency ∝ 1/rank • Example: • the 50th most common word should occur three times more often than the 150th • First observed by Estoup (1916) • there are a few very common words, a middling number of medium frequency words, and many low frequency words • speaker and the hearer are trying to minimize their effort

  32. Zipf’s law

  33. Zipf’s law regular scales (non-logarithmic)

  34. Other Zipf laws • # meanings ∝ √frequency ∝ 1/√rank • Length of a word ∝ 1/frequency

  35. Sets of strings • Often, we deal with the occurrence and frequencies ofsetsof strings • given a sentence with the word bank, did the words teller or tellers occur in the sentence? • how many times did the various forms of the word dissect (dissect, dissection, dissected, dissectible) occur in a book • What are the different dates mentioned in a history book Regular expressionsare a way of identifying sets of strings

  36. Regular Expressions and Automata

  37. Regular Expressions • A formula/notation in a special language that is used for specifying simple classes/sets of strings • Developed by Kleene (1956) • Regular expressions can be implemented by finite state automaton • Variations of automata • finite-state trans- ducers and hidden Markov models • speech recognition and synthesis, machine translation, spell-checking, and IE

  38. Example REs olympics  olympics

  39. Example REs olympics  olympics a,…,d  a, b, c, d

  40. Example REs olympicsolympics a,…,d a, b, c, d INFORMAL

  41. Example REs olympics  olympics [abcd] a, b, c, d

  42. Example REs olympics  olympics [abcd] a, b, c, d [a-d] a, b, c, d

  43. Example REs olympics  olympics [abcd] a, b, c, d [a-d] a, b, c, d [Oo]lympics]  Olympics, olympics

  44. Example REs olympics  olympics [abcd] a, b, c, d [a-d] a, b, c, d [Oo]lympics]  Olympics, olympics [A-Z]9  A9, B9, C9,…, M9,…, Z9

  45. Example REs olympics  olympics [abcd] a, b, c, d [a-d] a, b, c, d [Oo]lympics]  Olympics, olympics [A-Z]9  A9, B9, C9,…, M9,…, Z9 [^a-d]  e, f,…, z

  46. Example REs olympics  olympics [abcd] a, b, c, d [a-d] a, b, c, d [Oo]lympics]  Olympics, olympics [A-Z]9  A9, B9, C9,…, M9,…, Z9 [^a-d]  e, f,…, z yours|mine  yours, mine

  47. Regular expressions • Optional characters ? ,* and +

  48. Regular expressions • Optional characters ? ,* and + • ? (0 or 1) colou?r color, colour

  49. Regular expressions • Optional characters ? ,* and + • ? (0 or 1) colou?r color, colour • * (0 or more) oo*h!  oh!, ooh!, oooh!,…

More Related