700 likes | 895 Views
Introduction & Tokenization. Ling570 Shallow Processing Techniques for NLP September 28, 2011 . Roadmap. Course Overview Tokenization Homework #1. Course Overview. Course Information. Course web page: http://courses.washington.edu/ling570 Syllabus: Schedule and readings
E N D
Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011
Roadmap • Course Overview • Tokenization • Homework #1
Course Information • Course web page: • http://courses.washington.edu/ling570 • Syllabus: • Schedule and readings • Links to other readings, slides, links to class recordings • Slides posted before class, but may be revised • Catalyst tools: • GoPost discussion board for class issues • CollectItDropbox for homework submission and TA comments • Gradebook for viewing all grades
GoPost Discussion Board • Main venue for course-related questions, discussion • What not to post: • Personal, confidential questions; Homework solutions • What to post: • Almost anything else course-related • Can someone explain…? • Is this really supposed to take this long to run? • Key location for class participation • Post questions or answers • Your discussion space: Sanghoun & I will not jump in often
GoPost • Emily’s 5-minute rule: • If you’ve been stuck on a problem for more than 5 minutes, post to the GoPost! • Mechanics: • Please use your UW NetID as your user id • Please post early and often ! • Don’t wait until the last minute • Notifications: • Decide how you want to receive GoPost postings
Email • Should be used only for personal or confidential issues • Grading issues, extended absences, other problems • General questions/comments go on GoPost • Please send email from your UW account • Include Ling570 in the subject • If you don’t receive a reply in 24 hours, please follow-up
Homework Submission • All homework should be submitted through CollectIt • Tar cvf hw1.tar hw1_dir • Homework due 11:45 Wednesdays • Late homework receives 10%/day penalty (incremental) • Most major programming languages accepted • C/C++/C#, Java, Python, Perl, Ruby • If you want to use something else, please check first • Please follow naming, organization guidelines in HW • Expect to spend 10-20 hours/week, including HW docs
Grading • Assignments: 90% • Class participation: 10% • No midterm or final exams • Grades in Catalyst Gradebook • TA feedback returned through CollectIt • Incomplete: only if all work completed up last two weeks • UW policy
Recordings • All classes will be recorded • Links to recordings appear in syllabus • Available to all students, DL and in class • Please remind me to: • Record the meeting (look for the red dot) • Repeat in-class questions • Note: Instructor’s screen is projected in class • Assume that chat window is always public
Contact Info • Gina: Email: levow@uw.edu • Office hour: • Fridays: 10-11 (before Treehouse meeting) • Location: Padelford B-201 • Or by arrangement • Available by Skype or Adobe Connect • All DL students should arrange a short online meeting • TA: Sanghoun Song: Email: sanghoun@uw.edu • Office hour: Time: TBD, see GoPost • Location:
Online Option • Please check you are registered for correct section • CLMS online: Section A • State-funded: Section B • CLMS in-class: Section C • NLT/SCE online (or in-class): Section D • Online attendance for in-class students • Not more than 3 times per term (e.g. missed bus, ice) • Please enter meeting room 5-10 before start of class • Try to stay online throughout class
Online Tip • If you see: • You are not logged into Connect. The problem is one of the following: the permissions on the resource you are trying to access are incorrectly set.Please contact your instructor/Meeting Host/etc. • you do not have a Connect account but need to have one. For UWEO students: • If you have just created your UW NetID or just enrolled in a course • ….. • Clear your cache, close and restart your browser
Course Prerequisites • Programming Languages: • Java/C++/Python/Perl/.. • Operating Systems: Basic Unix/linux • CS 326 (Data structures) or equivalent • Lists, trees, queues, stacks, hash tables, … • Sorting, searching, dynamic programming,.. • Automata, regular expressions,… • Stat 391 (Probability and statistics): random variables, conditional probability, Bayes’ rule, ….
Textbook • Jurafsky and Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edition, 2008 • Available from UW Bookstore, Amazon, etc • Reference: Manning and Schutze, Foundations of Statistical Natural Language Processing
Topics in Ling570 • Unit #1: Formal Languages and Automata (2-3 weeks) • Formal languages • Finite-state Automata • Finite-state Transducers • Morphological analysis • Unit #2: Ngram Language Models and HMMs • Ngram Language Models and Smoothing • Part-of-speech (POS) tagging: • HMM • Ngram
Topics in Ling570 • Unit #3: Classification (2-3 weeks) • Intro to classification • POS tagging with classifiers • Chunking • Named Entity (NE) recognition • Other topics (2 weeks) • Intro, tokenization • Clustering • Information Extraction • Summary
Roadmap • Motivation: • Applications • Language and Thought • Knowledge of Language • Cross-cutting themes • Ambiguity, Evaluation, & Multi-linguality • Course Overview
Motivation: Applications • Applications of Speech and Language Processing • Call routing • Information retrieval • Question-answering • Machine translation • Dialog systems • Spam tagging • Spell- , Grammar- checking • Sentiment Analysis • Information extraction….
Shallow vs Deep Processing • Shallow processing (Ling 570) • Usually relies on surface forms (e.g., words) • Less elaborate linguistic representations • E.g. Part-of-speech tagging; Morphology; Chunking
Shallow vs Deep Processing • Shallow processing (Ling 570) • Usually relies on surface forms (e.g., words) • Less elaborate linguistic representations • E.g. Part-of-speech tagging; Morphology; Chunking • Deep processing (Ling 571) • Relies on more elaborate linguistic representations • Deep syntactic analysis (Parsing) • Rich spoken language understanding (NLU)
Shallow or Deep? • Applications of Speech and Language Processing • Call routing • Information retrieval • Question-answering • Machine translation • Dialog systems • Spam tagging • Spell- , Grammar- checking • Sentiment Analysis • Information extraction….
Language & Intelligence • Turing Test: (1949) – Operationalize intelligence • Two contestants: human, computer • Judge: human • Test: Interact via text questions • Question: Can you tell which contestant is human? • Crucially requires language use and understanding
Limitations of Turing Test • ELIZA (Weizenbaum 1966) • Simulates Rogerian therapist • User: You are like my father in some ways • ELIZA: WHAT RESEMBLANCE DO YOU SEE • User: You are not very aggressive • ELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE…
Limitations of Turing Test • ELIZA (Weizenbaum 1966) • Simulates Rogerian therapist • User: You are like my father in some ways • ELIZA: WHAT RESEMBLANCE DO YOU SEE • User: You are not very aggressive • ELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE... • Passes the Turing Test!! (sort of)
Limitations of Turing Test • ELIZA (Weizenbaum 1966) • Simulates Rogerian therapist • User: You are like my father in some ways • ELIZA: WHAT RESEMBLANCE DO YOU SEE • User: You are not very aggressive • ELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE... • Passes the Turing Test!! (sort of) • “You can fool some of the people....”
Limitations of Turing Test • ELIZA (Weizenbaum 1966) • Simulates Rogerian therapist • User: You are like my father in some ways • ELIZA: WHAT RESEMBLANCE DO YOU SEE • User: You are not very aggressive • ELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE... • Passes the Turing Test!! (sort of) • “You can fool some of the people....” • Simple pattern matching technique • Very shallow processing
Turing Test Revived • “On the web, no one knows you’re a….” • Problem: ‘bots’ • Automated agents swamp services • Challenge: Prove you’re human • Test: Something human can do, ‘bot can’t
Turing Test Revived • “On the web, no one knows you’re a….” • Problem: ‘bots’ • Automated agents swamp services • Challenge: Prove you’re human • Test: Something human can do, ‘bot can’t • Solution: CAPTCHAs • Distorted images: trivial for human; hard for ‘bot
Turing Test Revived • “On the web, no one knows you’re a….” • Problem: ‘bots’ • Automated agents swamp services • Challenge: Prove you’re human • Test: Something human can do, ‘bot can’t • Solution: CAPTCHAs • Distorted images: trivial for human; hard for ‘bot • Key: Perception, not reasoning
Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. • HAL: I'm sorry, Dave. I'm afraid I can't do that.
Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. • HAL: I'm sorry, Dave. I'm afraid I can't do that. • Phonetics & Phonology (Ling 450/550) • Sounds of a language, acoustics • Legal sound sequences in words
Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. • HAL: I'm sorry, Dave. I'm afraid I can't do that. • Morphology (Ling 570) • Recognize, produce variation in word forms • Singular vs. plural: Door + sg: -> door; Door + plural -> doors • Verb inflection: Be + 1st person, sg, present -> am
Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. • HAL: I'm sorry, Dave. I'm afraid I can't do that. • Part-of-speech tagging (Ling 570) • Identify word use in sentence • Bay (Noun) --- Not verb, adjective
Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. • HAL: I'm sorry, Dave. I'm afraid I can't do that. • Syntax • (Ling 566: analysis; Ling 570 – chunking; Ling 571- parsing) • Order and group words in sentence • I’m I do , sorry that afraid Dave I can’t.
Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. • HAL: I'm sorry, Dave. I'm afraid I can't do that. • Semantics (Ling 571) • Word meaning: • individual (lexical), combined (compositional) • ‘Open’ : AGENT cause THEME to become open; • ‘pod bay doors’ : (pod bay) doors
Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. (request) • HAL: I'm sorry, Dave. I'm afraid I can't do that. (statement) • Pragmatics/Discourse/Dialogue (Ling 571, maybe) • Interpret utterances in context • Speech act (request, statement) • Reference resolution: I = HAL; that = ‘open doors’ • Politeness: I’m sorry, I’m afraid I can’t
Cross-cutting Themes • Ambiguity • How can we select among alternative analyses? • Evaluation • How well does this approach perform: • On a standard data set? • When incorporated into a full system? • Multi-linguality • Can we apply this approach to other languages? • How much do we have to modify it to do so?
Ambiguity • “I made her duck” • Means....
Ambiguity • “I made her duck” • Means.... • I caused her to duck down
Ambiguity • “I made her duck” • Means.... • I caused her to duck down • I made the (carved) duck she has
Ambiguity • “I made her duck” • Means.... • I caused her to duck down • I made the (carved) duck she has • I cooked duck for her
Ambiguity • “I made her duck” • Means.... • I caused her to duck down • I made the (carved) duck she has • I cooked duck for her • I cooked the duck she owned
Ambiguity • “I made her duck” • Means.... • I caused her to duck down • I made the (carved) duck she has • I cooked duck for her • I cooked the duck she owned • I magically turned her into a duck
Ambiguity: POS • “I made her duck” • Means.... • I caused her to duck down • I made the (carved) duck she has • I cooked duck for her • I cooked the duck she owned • I magically turned her into a duck V Poss N Pron
Ambiguity: Syntax • “I made her duck” • Means.... • I made the (carved) duck she has • ((VP (V made) (NP (POSS her) (N duck)))
Ambiguity: Syntax • “I made her duck” • Means.... • I made the (carved) duck she has • ((VP (V made) (NP (POSS her) (N duck))) • I cooked duck for her • ((VP (V made) (NP (PRON her)) (NP (N (duck)))
Ambiguity • Pervasive • Pernicious • Particularly challenging for computational systems • Problem we will return to again and again in class
Tokenization • Given input text, split into words or sentences • Tokens: words, numbers, punctuation • Example: • Sherwood said reaction has been "very positive.” • Sherwood said reaction has been ” very positive . " • Why tokenize?