390 likes | 506 Views
SIMS 290-2: Applied Natural Language Processing. Marti Hearst August 30, 2004. Today. Motivation: SIMS student projects Course Goals Why NLP is difficult How to solve it? Corpus-based statistical approaches What we’ll do in this course. ANLP Motivation: SIMS Masters Projects.
E N D
SIMS 290-2: Applied Natural Language Processing Marti Hearst August 30, 2004
Today • Motivation: SIMS student projects • Course Goals • Why NLP is difficult • How to solve it? Corpus-based statistical approaches • What we’ll do in this course
ANLP Motivation:SIMS Masters Projects • Breaking Story (2002) • Summarize trends in news feeds • Needs categories and entities assigned to all news articles http://dream.sims.berkeley.edu/newshound/ • BriefBank (2002) • System for entering legal briefs • Needs a topic category system for browsing http://briefbank.samuelsonclinic.org/ • Chronkite (2003) • Personalized RSS feeds • Needs categories and entities assigned to all web pages • Paparrazi (2004) • Analysis of blog activity • Needs categories assigned to blog content
Goals of this Course • Learn about the problems and possibilities of natural language analysis: • What are the major issues? • What are the major solutions? • How well do they work • How do they work (but to a lesser extent than CS 295-4) • At the end you should: • Agree that language is subtle and interesting! • Feel some ownership over the algorithms • Be able to assess NLP problems • Know which solutions to apply when, and how • Be able to read papers in the field
Today • Motivation: SIMS student projects • Course Goals • Why NLP is difficult • How to solve it? Corpus-based statistical approaches • What we’ll do in this course
We’ve past the year 2001,but we are not closeto realizing the dream(or nightmare …)
Dave Bowman: “Open the pod bay doors, HAL” HAL 9000: “I’m sorry Dave. I’m afraid I can’t do that.”
Why is NLP difficult? • Computers are not brains • There is evidence that much of language understanding is built-in to the human brain • Computers do not socialize • Much of language is about communicating with people • Key problems: • Representation of meaning • Language presupposed knowledge about the world • Language only reflects the surface of meaning • Language presupposes communication between people
Hidden Structure • English plural pronunciation • Toy + s toyz ; add z • Book + s books ; add s • Church + s churchiz ; add iz • Box + s boxiz ; add iz • Sheep + s sheep ; add nothing • What about new words? • Bach + ‘s boxs ; why not boxiz? Adapted from Robert Berwick's 6.863J
Language subtleties • Adjective order and placement • A big black dog • A big black scary dog • A big scary dog • A scary big dog • A black big dog • Antonyms • Which sizes go together? • Big and little • Big and small • Large and small • Large and little
World Knowledge is subtle • He arrived at the lecture. • He chuckled at the lecture. • He arrived drunk. • He chuckled drunk. • He chuckled his way through the lecture. • He arrived his way through the lecture. Adapted from Robert Berwick's 6.863J
Words are ambiguous(have multiple meanings) • I know that. • I know that block. • I know that blocks the sun. • I know that block blocks the sun. Adapted from Robert Berwick's 6.863J
Headline Ambiguity • Iraqi Head Seeks Arms • Juvenile Court to Try Shooting Defendant • Teacher Strikes Idle Kids • Kids Make Nutritious Snacks • British Left Waffles on Falkland Islands • Red Tape Holds Up New Bridges • Bush Wins on Budget, but More Lies Ahead • Hospitals are Sued by 7 Foot Doctors Adapted from Robert Berwick's 6.863J
The Role of Memorization • Children learn words quickly • As many as 9 words/day • Often only need one exposure to associate meaning with word • Can make mistakes, e.g., overgeneralization “I goed to the store.” • Exactly how they do this is still under study
The Role of Memorization • Dogs can do word association too! • Rico, a border collie in Germany • Knows the names of each of 100 toys • Can retrieve items called out to him with over 90% accuracy. • Can also learn and remember the names of unfamiliar toys after just one encounter, putting him on a par with a three-year-old child. http://www.nature.com/news/2004/040607/pf/040607-8_pf.html
But there is too much to memorize! establish establishment the church of England as the official state church. disestablishment antidisestablishment antidisestablishmentarian antidisestablishmentarianism is a political philosophy that is opposed to the separation of church and state. Adapted from Robert Berwick's 6.863J
Rules and Memorization • Current thinking in psycholinguistics is that we use a combination of rules and memorization • However, this is very controversial • Mechanism: • If there is an applicable rule, apply it • However, if there is a memorized version, that takes precedence. (Important for irregular words.) • Artists paint “still lifes” • Not “still lives” • Past tense of • think thought • blink blinked • This is a simplification; for more on this, see Pinker’s “Words and Language” and “The Language Instinct”.
Representation of Meaning • I know that block blocks the sun. • How do we represent the meanings of “block”? • How do we represent “I know”? • How does that differ from “I know that.”? • Who is “I”? • How do we indicate that we are talking about earth’s sun vs. some other planet’s sun? • When did this take place? What if I move the block? What if I move my viewpoint? How do we represent this?
How to tackle these problems? • The field was stuck for quite some time. • A new approach started around 1990 • Well, not really new, but the first time around, in the 50’s, they didn’t have the text, disk space, or GHz • Main idea: combine memorizing and rules • How to do it: • Get large text collections (corpora) • Compute statistics over the words in those collections • Surprisingly effective • Even better now with the Web
Corpus-based Example: Pre-Nominal Adjective Ordering • Important for translation and generation • Examples: • big fat Greek wedding • fat Greek big wedding • Some approaches try to characterize this as semantic rules, e.g.: • Age < color, value < dimension • Data-intensive approaches • Assume adjective ordering is independent of the noun they modify • Compare how often you see {a, b} vs {b, a} Keller & Lapata, “The Web as Baseline”, HLT-NAACL’04
Corpus-based Example: Pre-Nominal Adjective Ordering • Data-intensive approaches • Compare how often you see {a, b} vs {b, a} • What happens when you encounter an unseen pair? • Shaw and Hatzivassiloglou ’99 use transitive closutres • Malouf ’00 uses a back-off bigram model • P(<a,b>|{a,b}) vs. P(<b,a>|{a,b}) • He also uses morphological analysis, semantic similarity calculations and positional probabilities • Keller and Lapata ’04 use just the very simple algorithm • But they use the web as their training set • Gets 90% accuracy on 1000 sequences • As good as or better than the complex algorithms Keller & Lapata, “The Web as Baseline”, HLT-NAACL’04
Real-World Applications of NLP • Spelling Suggestions/Corrections • Grammar Checking • Synonym Generation • Information Extraction • Text Categorization • Automated Customer Service • Speech Recognition (limited) • Machine Translation • In the (near?) future: • Question Answering • Improving Web Search Engine results • Automated Metadata Assignment • Online Dialogs Adapted from Robert Berwick's 6.863J
NLP in the Real World • Synonym generation for • Suggesting advertising keywords • Suggesting search result refinement and expansion
What We’ll Do in this Course • Read research papers and tutorials • Use NLTK (Natural Language ToolKit) to try out various algorithms • Some homeworks will be to do some NLTK exercises • Three mini-projects • Two involve a selected collection • The third is your choice, can also be on the selected collection
What We’ll Do in this Course • Adopt a large text collection • Use a wide range of NLP techniques to process it • Release the results for others to use
How to analyze a big collection? • Your ideas go here
Python • A terrific language • Interpreted • Object-oriented • Easy to interface to other things (web, DBMS, TK) • Good stuff from: java, lisp, tcl, perl • Easy to learn • I learned it this summer by reading Learning Python • FUN!