720 likes | 905 Views
CMCS723/LING723 Computational Linguistics I. Language of the subconscious, by WildCherry. - Saif Mohammad. The instruction team. Instructor: Saif Mohammad Co-instructor: Nitin Madnani Coordinator: Professor Bonnie Dorr Teaching Assistant: Sajib Dasgupta. The instruction team.
E N D
CMCS723/LING723 Computational Linguistics I Language of the subconscious, by WildCherry - Saif Mohammad
The instruction team • Instructor: Saif Mohammad • Co-instructor: Nitin Madnani • Coordinator: Professor Bonnie Dorr • Teaching Assistant: Sajib Dasgupta
The instruction team • Instructor: Saif Mohammad • Co-instructor: Nitin Madnani • Coordinator: Professor Bonnie Dorr • Teaching Assistant: Sajib Dasgupta • Guest Lectures: • Bonnie Dorr • Philip Resnik • Doug Oard
You (pre-requisites) • Competent programmers
You (pre-requisites) • Competent programmers • Do not have to be linguists • Have high-school English behind you • Know parts of speech, syntactic parse trees, subject, object,… • Read material on word classes and context-free grammars from J&M chapters 5 and 12 for background
Administrivia • Text: • Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics, second edition (published in 2008), by Daniel Jurafsky and James H. Martin. • Course webpage: • http://www.umiacs.umd.edu/~saif/WebPages/CS723.htm • Class: • Wednesdays, 4 to 6:30pm (5--10 min break in between)
Course grade • Exams: 50% • midterm exam: 25% • final exam: 25% • Class assignments/projects: 45% • Assignment 1 through 4: 10%, 12.5%, 10%, 12.5% • Assignment 0: no credit • designed to calibrate programming skills • Class participation: 5% • Showing up for class, demonstrating preparedness, and contributing to class discussions.
Out-of-class support • Office hours: • Saif: by appointment • Sajib: TA room 1112 • Mondays: 4 to 5:30 pm • Tuesdays: 2 to 3:30 pm • Forum: • https://forum.cs.umd.edu/forumdisplay.php?f=113
Nitin’s Role • Focus on Statistical Models • HMMs, EM, N-gram LMs, TAGs (approx. 4 lectures) • Assignments • All written in Python/NLTK • Python/NLTK tutorial next week (show up!) • Assignment 0 (not for credit) • Purpose: Introspection and Practice • Try to solve problem 1 before tutorial next week, problem 2 after
Nitin’s Role • Forums • Register unless already registered for another class • Preferred way to ask questions • Feel free to start discussion threads, if necessary • Subscribe to notifications!
What is Computational Linguistics? • Study of computer processing, understanding, and generation of human languages • Interdisciplinary field • Linguistics, machine learning and artificial intelligence, statistics, cognitive science, psychology, and others • Common applications: • Machine translation, information retrieval, text summarization, question answering
Overview and History of Computational Linguistics Professor Bonnie Dorr
Practical NLP system • Disambiguation decisions of word sense, word category, syntactic structure,… • Maximize coverage, minimize errors (false positives) • Robust • Generalize well
Traditional NLP • AI approaches with deep understanding had hand-coded rules • Creating the rules is time-consuming • One may miss rules; sometimes the rules are too many to encode • May not scale to different domains • Brittle (metaphors) I swallowed his story
Statistical NLP • Counting things • Determining patterns that occur in language use • Features: • Learn rules, patterns automatically • Statistical models are robust, generalize well, and behave gracefully when faced with less-than-perfect conditions
Corpus-based NLP • Corpus: a collection of natural language documents • British National Corpus, Wall Street journal, google’s web-indexed corpus, switch-board corpus • Can we learn how language works from this text? • Look for patterns in the corpus
Features of a corpus • Size • Balanced or domain-specific • Written or spoken • Raw or annotated (senses, pos, structure) • Electronically available or hard copy • Free to use or one needs to pay for a license
More corpora • Brown • Susanne • Penn Treebank • Canadian Hansards
Other lexical resources • Dictionaries • Gloss, example sentence • Thesauri • categories, paragraphs, semicolon units • WordNet • synsets, gloss • hypernyms, holonyms, troponyms
What are the most frequent words? Tom Sawyer
What are the most frequent words? Tom Sawyer the333 determiner (article) and2972 conjunction a1775 determiner to1725 preposition, verbal infinitive marker of1440 preposition was1161 auxiliary verb it1027 (personal/expletive) pronoun in906 preposition
How many words are there? Tom Sawyer • Tokens: 71,370 • Types: 8,018 • Memory: half a megabyte • Average frequency of a word • # tokens / # types = 8.9
The distribution of words freq freq of freq 1 3993 2 1292 3 664 4 410 5 243 6 199 7 172 Tom Sawyer freq freq of freq 8 131 9 82 10 91 11–50 540 51–100 99 > 100 102
The distribution of words • Hapax legomena • word types that occur only once in the corpus
The distribution of words • Hapax legomena • word types that occur only once in the corpus • Direct applications of simple word counts • cryptography, style of authorship • Indirectly, counts are used pervasively in NLP
The distribution of words • Hapax legomena • word types that occur only once in the corpus • Direct applications of simple word counts • cryptography, style of authorship • Indirectly, counts are used pervasively in NLP • Why is statistical NLP difficult? • hard to predict much about the behavior of words that occur rarely (if at all)
Human Behavior and the Principle of Least Effort • The Principle of Least Effort: “people will act so as to minimize their probable average rate of work” • Evidence: • Underlying statistical distributions in language • Count up words in a corpus • List (rank) words in order of frequency
Zipf’s law • frequency ∝ 1/rank • Example: • the 50th most common word should occur three times more often than the 150th • First observed by Estoup (1916) • there are a few very common words, a middling number of medium frequency words, and many low frequency words • speaker and the hearer are trying to minimize their effort
Zipf’s law regular scales (non-logarithmic)
Other Zipf laws • # meanings ∝ √frequency ∝ 1/√rank • Length of a word ∝ 1/frequency
Sets of strings • Often, we deal with the occurrence and frequencies ofsetsof strings • given a sentence with the word bank, did the words teller or tellers occur in the sentence? • how many times did the various forms of the word dissect (dissect, dissection, dissected, dissectible) occur in a book • What are the different dates mentioned in a history book Regular expressionsare a way of identifying sets of strings
Regular Expressions • A formula/notation in a special language that is used for specifying simple classes/sets of strings • Developed by Kleene (1956) • Regular expressions can be implemented by finite state automaton • Variations of automata • finite-state trans- ducers and hidden Markov models • speech recognition and synthesis, machine translation, spell-checking, and IE
Example REs olympics olympics
Example REs olympics olympics a,…,d a, b, c, d
Example REs olympicsolympics a,…,d a, b, c, d INFORMAL
Example REs olympics olympics [abcd] a, b, c, d
Example REs olympics olympics [abcd] a, b, c, d [a-d] a, b, c, d
Example REs olympics olympics [abcd] a, b, c, d [a-d] a, b, c, d [Oo]lympics] Olympics, olympics
Example REs olympics olympics [abcd] a, b, c, d [a-d] a, b, c, d [Oo]lympics] Olympics, olympics [A-Z]9 A9, B9, C9,…, M9,…, Z9
Example REs olympics olympics [abcd] a, b, c, d [a-d] a, b, c, d [Oo]lympics] Olympics, olympics [A-Z]9 A9, B9, C9,…, M9,…, Z9 [^a-d] e, f,…, z
Example REs olympics olympics [abcd] a, b, c, d [a-d] a, b, c, d [Oo]lympics] Olympics, olympics [A-Z]9 A9, B9, C9,…, M9,…, Z9 [^a-d] e, f,…, z yours|mine yours, mine
Regular expressions • Optional characters ? ,* and +
Regular expressions • Optional characters ? ,* and + • ? (0 or 1) colou?r color, colour
Regular expressions • Optional characters ? ,* and + • ? (0 or 1) colou?r color, colour • * (0 or more) oo*h! oh!, ooh!, oooh!,…