560 likes | 728 Views
Ling 570. Introduction and Overview. Roadmap. Course Overview Tokenization Homework #1. Course Overview. Course Information. Course web page: http://courses.washington.edu/ling570 Syllabus: Schedule and readings Links to other readings, slides, links to class recordings
E N D
Ling 570 Introduction and Overview
Roadmap • Course Overview • Tokenization • Homework #1
Course Information • Course web page: • http://courses.washington.edu/ling570 • Syllabus: • Schedule and readings • Links to other readings, slides, links to class recordings • Slides posted before class, but may be revised • Catalyst tools: • GoPost discussion board for class issues • CollectItDropbox for homework submission and TA comments • Gradebook for viewing all grades
GoPost Discussion Board • Main venue for course-related questions, discussion • What not to post: • Personal, confidential questions; Homework solutions • What to post: • Almost anything else course-related • Can someone explain…? • Is this really supposed to take this long to run? • Key location for class participation • Post questions or answers • Your discussion space: Sanghoun & instructors will not jump in often
GoPost • Emily’s 5-minute rule: • If you’ve been stuck on a problem for more than 5 minutes, post to the GoPost! • Mechanics: • Please use your UW NetID as your user id • Please post early and often ! • Don’t wait until the last minute • Notifications: • Decide how you want to receive GoPost postings
Email • Should be used only for personal or confidential issues • Grading issues, extended absences, other problems • General questions/comments go on GoPost • Please send email from your UW account • Include Ling570 in the subject • If you don’t receive a reply in 24 hours, please follow-up
Homework Submission • All homework should be submitted through CollectIt • Tar cvf hw1.tar hw1_dir • Homework due 11:45 Tuesday(unless otherwise noted) • Late homework receives 10%/day penalty (incremental) • Most major programming languages accepted • C/C++/C#, Java, Python, Perl, Ruby • If you want to use something else, please check first • Please follow naming, organization guidelines in HW • Expect to spend 10-20 hours/week, including HW docs
Grading • Homeworks: 50% • Projects: 40% • Class participation: 10% • No midterm or final exams • Grades in Catalyst Gradebook • TA feedback returned through CollectIt • Incomplete: only if all work completed up last two weeks • UW policy
Broadcast • Class will (mostly) be taught in 99/1915 • Broadcast live to CSE 305 • Streamed to remote students (also on syllabus): • http://www.cs.washington.edu/events/webcast/ling570.asx • http://www.cs.washington.edu/events/webcast/ling570.wbv (with ink) • Requires the ConferenceXPWebViewersoftware to view • Other formats have no ink (but students can follow slides by downloading the Powerpoints)
Interacting • Live interaction between CSE 305 and 99/1915 • Remote students: chat channel over Skype • Please send friend request to: • ling570uw • Skype chat session will be started 5-10 mins. prior to class • Will be monitored the whole class
Recordings • All classes will be recorded • Links to recordings appear in syllabus • They’ll be recorded in multiple formats • Available to all students, DL and in class • Multiple formats supported for recordings • Ink may be only available inConference XP WebViewer
Contact Info • Will: Email: wilewis@microsoft.com • Chris: Email: chrisq@microsoft.com • Office hours: • Thursday: 9-11 • Location: Padelford • Or by arrangement (e.g., on MS campus) • TA: Sanghoun Song: Email: sanghoun@uw.edu • Office hour: Tuesday 9-11 • Location: Lewis Annex 110
Online Option • Please check you are registered for correct section • CLMS online: Section A • State-funded: Section B • CLMS in-class: Section C • NLT/SCE online (or in-class): Section D • Online attendance for in-class students • Not more than 3 times per term (e.g. missed bus, ice) • Please enter meeting room 5-10 before start of class • Try to stay online throughout class • Note: if attending on Microsoft Campus • If not a Microsofty, please arrive at 99 before 5:00 for admission to building
Course Prerequisites • Programming Languages: • Java/C++/Python/Perl/.. • Operating Systems: Basic Unix/linux • CS 326 (Data structures) or equivalent • Lists, trees, queues, stacks, hash tables, … • Sorting, searching, dynamic programming,.. • Automata, regular expressions,… • Stat 391 (Probability and statistics): random variables, conditional probability, Bayes’ rule, ….
Textbooks • Jurafsky and Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edition, 2008 • Available from UW Bookstore, Amazon, etc • Manning and Schuetze, Foundations of Statistical Natural Language Processing
Roadmap • Motivation: • Applications • Language and Thought • Knowledge of Language • Cross-cutting themes • Ambiguity, Evaluation, & Multi-linguality • Course Overview
Motivation: Applications • Applications of Speech and Language Processing • Call routing • Information retrieval • Question-answering • Machine translation • Dialog systems • Spam tagging • Spell- , Grammar- checking • Sentiment Analysis • Information extraction….
Shallow vs Deep Processing • Shallow processing (Ling 570) • Usually relies on surface forms (e.g., words) • Less elaborate linguistic representations • E.g. Part-of-speech tagging; Morphology; Chunking
Shallow vs Deep Processing • Shallow processing (Ling 570) • Usually relies on surface forms (e.g., words) • Less elaborate linguistic representations • E.g. Part-of-speech tagging; Morphology; Chunking • Deep processing (Ling 571) • Relies on more elaborate linguistic representations • Deep syntactic analysis (Parsing) • Rich spoken language understanding (NLU)
Language & Intelligence • Turing Test: (1949) – Operationalize intelligence • Two contestants: human, computer • Judge: human • Test: Interact via text questions • Question: Can you tell which contestant is human? • Crucially requires language use and understanding
Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. • HAL: I'm sorry, Dave. I'm afraid I can't do that.
Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. • HAL: I'm sorry, Dave. I'm afraid I can't do that. • Phonetics & Phonology (Ling 450/550) • Sounds of a language, acoustics • Legal sound sequences in words
Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. • HAL: I'm sorry, Dave. I'm afraid I can't do that. • Morphology (Ling 570) • Recognize, produce variation in word forms • Singular vs. plural: Door + sg: -> door; Door + plural -> doors • Verb inflection: Be + 1st person, sg, present -> am
Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. • HAL: I'm sorry, Dave. I'm afraid I can't do that. • Part-of-speech tagging (Ling 570) • Identify word use in sentence • Bay (Noun) --- Not verb, adjective
Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. • HAL: I'm sorry, Dave. I'm afraid I can't do that. • Syntax • (Ling 566: analysis; Ling 570 – chunking; Ling 571- parsing) • Order and group words in sentence • I’m I do , sorry that afraid Dave I can’t.
Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. • HAL: I'm sorry, Dave. I'm afraid I can't do that. • Semantics (Ling 571) • Word meaning: • individual (lexical), combined (compositional) • ‘Open’ : AGENT cause THEME to become open; • ‘pod bay doors’ : (pod bay) doors
Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. (request) • HAL: I'm sorry, Dave. I'm afraid I can't do that. (statement) • Pragmatics/Discourse/Dialogue (Ling 571, maybe) • Interpret utterances in context • Speech act (request, statement) • Reference resolution: I = HAL; that = ‘open doors’ • Politeness: I’m sorry, I’m afraid I can’t
Cross-cutting Themes • Ambiguity • How can we select among alternative analyses? • Evaluation • How well does this approach perform: • On a standard data set? • When incorporated into a full system (e.g., E2E)? • Multi-linguality • Can we apply this approach to other languages? • How much do we have to modify it to do so?
Ambiguity • “Flying planes made her duck” • Means.... • Planes flying overhead caused her to duck down
Ambiguity • “Flying planes made her duck” • Means.... • Planes flying overhead caused her to duck down • While flying planes, she ducked
Ambiguity • “Flying planes made her duck” • Means.... • Planes flying overhead caused her to duck down • While flying planes, she ducked • Planes flying overhead made her a (carved) duck
Ambiguity • “Flying planes made her duck” • Means.... • Planes flying overhead caused her to duck down • While flying planes, she ducked • Planes flying overhead made her a (carved) duck • Planes flying overhead cooked a duck for her
Ambiguity • “Flying planes made her duck” • Means.... • Planes flying overhead caused her to duck down • While flying planes, she ducked • Planes flying overhead made her a (carved) duck • Planes flying overhead cooked a duck for her • Planes flying overhead cooked her duck
Ambiguity • “Flying planes made her duck” • Means.... • Planes flying overhead caused her to duck down • While flying planes, she ducked • Planes flying overhead made her a (carved) duck • Planes flying overhead cooked a duck for her • Planes flying overhead cooked her duck V Pron N Poss
Ambiguity: Syntax • “I made her duck” • Means.... • I made the (carved) duck she has • ((VP (V made) (NP (POSS her) (N duck)))
Ambiguity: Syntax • “…made her duck” • Means.... • …made the (carved) duck she has • ((VP (V made) (NP (POSS her) (N duck))) • …cooked duck for her • ((VP (V made) (NP (PRON her)) (NP (N (duck)))
Ambiguity • Pervasive • Pernicious • Particularly challenging for computational systems • Problem we will return to again and again in class
Tokenization • Given input text, split into words or sentences • Tokens: words, numbers, punctuation • Example: • Sherwood said reaction has been “very positive.” • Sherwood said reaction has been “ very positive . ” • Why tokenize?
Tokenization • Given input text, split into words or sentences • Tokens: words, numbers, punctuation • Example: • Sherwood said reaction has been “very positive.” • Sherwood said reaction has been “ very positive . ” • Why tokenize? • Identify basic units for downstream processing
Tokenization • Proposal 1: Split on white space • Good enough
Tokenization • Proposal 1: Split on white space • Good enough? No • Why not?
Tokenization • Proposal 1: Split on white space • Good enough? No • Why not? • Multi-linguality: • Languages without white space delimiters: Chinese, Japanese • Agglutinative languages (Hungarian, Korean) • meggazdagiithatnok • Compounding languages (German) • Lebensversicherungsgesellschaftsangestellter “Life insurance company employee” • Even with English, misses punctuation
Tokenization - again • Proposal 2: Split on white space and punctuation • For English • Good enough?
Tokenization - again • Proposal 2: Split on white space and punctuation • For English • Good enough? No • Problems:
Tokenization - again • Proposal 2: Split on white space and punctuation • For English • Good enough? No • Problems: Non-splitting punctuation • 1.23 1 . 23 X; 1,23 1 , 23 X • don’t don t X; E-mail E mail X, etc
Tokenization - again • Proposal 2: Split on white space and punctuation • For English • Good enough? No • Problems: Non-splitting punctuation • 1.23 1 . 23 X; 1,23 1 , 23 X • don’t don t X; E-mail E mail X, etc • Problems: no-splitting whitespace • Names: New York; Collocations: pick up
Tokenization - again • Proposal 2: Split on white space and punctuation • For English • Good enough? No • Problems: Non-splitting punctuation • 1.23 1 . 23 X; 1,23 1 , 23 X • don’t don t X; E-mail E mail X, etc • Problems: no-splitting whitespace • Names: New York; Collocations: pick up • What’s a word?
Sentence Segmentation • Similar issues • Proposal: Split on ‘.’ • Problems?