280 likes | 520 Views
CS 124/LINGUIST 180 From Languages to Information. Dan Jurafsky Stanford University Introduction and Course Overview. What this course is about. Automatically extracting meaning and structure from: Natural language text Speech Web pages Social networks (and other networks)
E N D
CS 124/LINGUIST 180From Languages to Information Dan Jurafsky Stanford University Introduction and Course Overview
What this course is about • Automatically extracting meaning and structure from: • Natural language text • Speech • Web pages • Social networks (and other networks) • Genome sequences
Commercial World • Lots of exciting stuff going on…
Information Extraction and Sentiment Analysis • http://www.bing.com/search?q=canon+powershot&go=&form=QBLH&qs=n • Sentiment analysis • Attribute detection • Relation extraction
Sentiment • Emotional Spell Check • New York Times “10 big ideas of 2010” • http://video.nytimes.com/video/2010/12/15/magazine/1248069422438/emotional-spell-check.html?scp=1&sq=emotional%20spell%20check&st=cse
Blog Analytics • Data-mining of blogs, discussion forums, message boards, user groups, and other forms of user generated media • Product marketing information • Political opinion tracking • Social network analysis • Buzz analysis (what’s hot, what topics are people talking about right now).
Livejournal.com: I, me, my on or after Sep 11, 2001 Cohn, Mehl, Pennebaker. 2004. Linguistic markers of psychological change surrounding September 11, 2001. Psychological Science 15, 10: 687-693. Graph from Pennebaker slides
September 11 LiveJournal.com study: We, us, our Cohn, Mehl, Pennebaker. 2004. Linguistic markers of psychological change surrounding September 11, 2001. Psychological Science 15, 10: 687-693. Graph from Pennebaker slides
Machine Translation • Helping human translators • Fully automatic Enter Source Text: 这 不过 是 一 个 时间 的 问题 . Translation from Stanford’s Phrasal: This is only a matter of time.
Google Translate • Fried ripe plantains: • http://laylita.com/recetas/2008/02/28/platanos-maduros-fritos/
Information Extraction Event: Curriculum mtg Date: Jan-16-2012 Start: 10:00am End:11:30am Where:Gates 159 Subject: curriculum meeting Date: January 15, 2012 To: Dan Jurafsky Hi Dan, we’ve now scheduled the curriculum meeting. It will be in Gates 159 tomorrow from 10:00-11:30. -Chris Create new Calendar entry
Pictures from SerafimBatzoglou Intron 1 Intron 2 5’ 3’ Exon 3 Exon 1 Exon 2 Splice sites Stop codon TAG/TGA/TAA Start codon ATG Computational Biology: Finding Genes
Slide stuff from SerafimBatzoglou Computational Biology: Comparing Sequences AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | | TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC • Sequence comparison is key to • Finding genes • Determining function • Uncovering the evolutionary processes
Ambiguity • Resolving ambiguity is a crucial goal throughout string and language processing
Ambiguity • Find at least 5 meanings of this sentence: • I made her duck
Ambiguity • Find at least 5 meanings of this sentence: • I made her duck • I cooked waterfowl for her benefit (to eat) • I cooked waterfowl belonging to her • I created the (plaster?) waterfowl she owns • I caused her to quickly lower her head or body • I waved my magic wand and turned her into undifferentiated waterfowl
Ambiguity is Pervasive • I caused her to quickly lower her head or body • Syntactic category: “duck” can be a Noun or Verb • I cooked waterfowl belonging to her. • Syntactic category: “her” can be a possessive (“of her”) or dative (“for her”) pronoun • I made the (plaster) duck statue she owns • Word Meaning : “make” can mean “create” or “cook”
Ambiguity is Pervasive • Grammar: makecan be: • Transitive: (verb has a noun direct object) • I cooked [waterfowl belonging to her] • Ditransitive: (verb has 2 noun objects) • I made [her] (into) [undifferentiated waterfowl] • Action-transitive (verb has a direct object + verb) • I caused [her] [to move her body]
Ambiguity is Pervasive: Phonetics!!!!! • I mate or duck • I’m eight or duck • Eye maid; her duck • Aye mate, her duck • I maid her duck • I’m aid her duck • I mate her duck • I’m ate her duck • I’m ate or duck • I mate or duck
Why else is natural language understanding difficult? segmentation issues non-standard English idioms Great job @justinbieber! Were SOO PROUD of what youve accomplished! U taught us 2 #neversaynever & you yourself should never give up either♥ dark horse get cold feet lose face throw in the towel the New York-New Haven Railroad the New York-New Haven Railroad tricky entity names world knowledge neologisms unfriend Retweet bromance Where is A Bug’s Life playing … Let It Be was recorded … … a mutation on the for gene … Mary and Sue are sisters. Mary and Sue are mothers. But that’s what makes it fun!
Making progress on this problem… • The task is difficult! What tools do we need? • Knowledge about language • Knowledge about the world • A way to combine knowledge sources • How we generally do this: • probabilistic models built from language data • P(“maison” “house”) high • P(“L’avocatgénéral” “the general avocado”) low • Luckily, rough text features can often do half the job.
Models • Finite state machines • Markov models • Alignment models • Genome alignment • Alignment of sentence in L1 to sentence in L2 • Alignment of text to speech • Vector space model of IR • Network models
Dynamic Programming • Don’t do the same work over and over. • Avoid this by building and making use of solutions to sub-problems that must be invariant across all parts of the space. • Minimum Edit Distance • The Viterbi Algorithm • Baum-Welch/Forward-Backward • (In parsing: CKY, Earley, charts, etc)
Machine Learning • Machine learning based classifiers that are trained to make decisions based on features extracted from the context • Simple Classifiers: • Naïve Bayes • Decision Trees • Sequence Models: • Hidden Markov Models • Maximum Entropy Markov Models • Conditional Random Fields
Course logistics in brief • Instructor: Dan Jurafsky • TAs: Leon Lin, Robin Melnick, Evan Rosen, Alden Timme, Adam Vogel • Time: TuTh 9:30-10:45, Braunlec • Requirements: • Online Video Lectures with embedded quizzes • Homeworks: In Java or Python • Online Review Exercises • Final Exam • Class sessions: • Tuesdays: Discussions/Guest Lectures • Thursdays: Open group working hours
Overview of the course • http://cs124.stanford.edu