1 / 42

Three kinds of web data that can help computers make better sense of human language

Three kinds of web data that can help computers make better sense of human language. Shane Bergsma Johns Hopkins University. Fall, 2011. Computers that understand language.

alair
Download Presentation

Three kinds of web data that can help computers make better sense of human language

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Three kinds of web data that can help computers make better sense of human language Shane Bergsma Johns Hopkins University Fall, 2011

  2. Computers that understand language “William Wilkinson’s ‘An Account of the Principalities of Wallachia and Moldavia’ inspired this author’s most famous novel.”

  3. Research Vision Robust processing of human language requires knowledge beyond what’s in small manually-annotated data sets Derive meaning from real-world data: • Raw text on the web • Bilingual text (words plus their translations) • Part 1: Parsing noun phrases • Visual data (labelled online images) • Part 2: Learning the meaning of words

  4. Part 1: Parsing Noun Phrases (NPs) Google: What pages/ads should be returned for the query “washed baby carrots”? • [washed baby] carrots vs. washed [baby carrots] carrots for washed babies baby carrots that are washed

  5. Training a parser via machine learning • washed baby carrots with weights, w0 PARSER • [washed baby] carrots TESTER INCORRECT in training data

  6. Training a parser via machine learning • washed baby carrots with weights, w1 PARSER Training corpus: retired [science teacher] [social science] teacher female [bus driver] [school bus] driver zebra [hair straightener] alleged [Canadian lover] … • washed [baby carrots] TESTER CORRECT in gold standard

  7. More data is better data (learning curve) [Banko& Brill, 2001] Grammar Correction Task

  8. Testing a parser on new data • Big Challenge: For parsing NPs, every word matters • - both parses are grammatical • - we can’t generalize from “washed baby carrots” in training to “washed baby smell” at test time • Solution: New sources of data • washed baby smell with final weights, wN PARSER • washed [baby smell] TESTER • - Having seen washed [baby carrots] in training… INCORRECT

  9. English Data for Parsing

  10. Task: Parsing NPs with conjunctions • [dairy andmeat]production • [sustainability]and [meat production] yes: [dairy production] in (1) no: [sustainability production] in (2) • Our contributions: new semantic features from raw web text and a new approach to using bilingual data as soft supervision [Bergsma, Yarowsky & Church, ACL 2011]

  11. One Noun Phrase or Two:A Machine Learning Approach Input: “dairy and meatproduction”→ features: x x= (…,first-noun=dairy, … second-noun=meat,… first+second-noun=dairy+meat, …) h(x) = w∙x(predict one NP if h(x) > 0) • Set w via training on annotated training data using some machine learning algorithm

  12. Leveraging Web-Derived Knowledge [dairy andmeat]production • If there is only one NP, then it is implicitly talking about “dairy production” • Do we see this phrase occurring a lot on the web? [Yes] sustainability and [meat production] • If there is only one NP, then it is implicitly talking about “sustainability production” • Do we see this phrase occurring a lot on the web? [No] • Classifier has features for these counts

  13. Search Engine Page Counts for NLP • Early web work: Use an Internet search engine to get web counts [Keller & Lapata, 2003] Problem: Using a search engine is just too inefficient to get data on a large scale

  14. Google N-gram Data for NLP • Google N-gram Data [Brants & Franz, 2006] • N words in sequence + their count on web: … dairy producers 22724 dairy production 17704 dairy professionals 204 dairy profits 82 dairy propaganda 15 dairy protein 1268 … • A compressed version of all the text on web • Enables new features/statistics for a range of tasks [Bergsma et al. ACL 2008, IJCAI 2009, ACL 2010, etc.]

  15. Features for Explicit Paraphrases ❶ and❷ ❸ ❶ and❷ ❸ New paraphrases extending ideas in [Nakov & Hearst, 2005]

  16. Human-Annotated Data (small) Training Examples motor and heating fuels freedom and security agenda conservation and good management Google N-gram Data Feature Vectors x1, x2, x3, x4 Machine Learning Raw Data (HUGE) Classifier: h(x)

  17. Using Bilingual Data • Bilingual data: a rich source of paraphrases dairy and meatproductionproducciónláctea y cárnica • Build a classifier which uses bilingual features • Applicable when we know the translation of the NP

  18. Bilingual “Paraphrase” Features ❶ and❷ ❸ ❶ and❷ ❸

  19. Bilingual “Paraphrase” Features ❶ and❷ ❸ ❶ and❷ ❸

  20. Human-Annotated Data (small) Training Examples motor and heating fuels freedom and security agenda conservation and good management Translation Data Feature Vectors x1, x2, x3, x4 Machine Learning Bilingual Data (medium) Classifier: h(xb)

  21. Training Examples + Features from Google Data h(xm) coal and steel money coal and steel money coal and steel money North and South Carolina North and South Carolina North and South Carolina rocket and mortar attacks rocket and mortar attacks rocket and mortar attacks pollution and transport safety pollution and transport safety pollution and transport safety business and computer science business and computer science business and computer science insurrection and regime change insurrection and regime change insurrection and regime change the environment and air transport the environment and air transport the environment and air transport the Bosporus and Dardanelles straits the Bosporus and Dardanelles straits the Bosporus and Dardanelles straits h(xb) Bitext Examples Training Examples + Features from Translation Data

  22. Training Examples + Features from Google Data h(xm) North and South Carolina North and South Carolina North and South Carolina pollution and transport safety pollution and transport safety pollution and transport safety business and computer science business and computer science insurrection and regime change insurrection and regime change insurrection and regime change the environment and air transport the environment and air transport the Bosporus and Dardanelles straits the Bosporus and Dardanelles straits business and computer science the environment and air transport the Bosporus and Dardanelles straits h(xb)1 Training Examples + Features from Translation Data coal and steel money rocket and mortar attacks

  23. Training Examples + Features from Google Data business and computer science the Bosporus and Dardanelles straits the environment and air transport h(xm)1 North and South Carolina North and South Carolina pollution and transport safety pollution and transport safety insurrection and regime change insurrection and regime change h(xb)1 Co-Training: [Yarowsky’95], [Blum & Mitchell’98] Training Examples + Features from Translation Data coal and steel money rocket and mortar attacks

  24. Error rate (%) of co-trained classifiers h(xb)i h(xm)i

  25. Error rate (%) on Penn Treebank (PTB) unsupervised h(xm)N 800 PTB training examples 800 PTB training examples 2 training examples

  26. Part 1: Conclusion • Knowledge from large-scale monolingual corpora is crucial for parsing noun phrases • New paraphrase features • New way to use bilingual data as soft supervision to guide the use of monolingual features • Next steps: Use bilingual data even when we don’t know the translations to begin with • infer translations jointly with syntax • i.e., beyond bitexts (1B), make use of huge (1T+) N-gram corpora in English, Spanish, French, …

  27. Part 2: Using visual data to learn the meaning of words • Large volumes of visual data also reveal word meaning (semantics), but in language-universal way • Humans label their images as they post them online, providing the word-meaning link • There’s lots of images to work with [from Facebook’s Twitter feed]

  28. Part 2: Using visual data to learn the meaning of words Progress in the area of “lexical semantics” Task #1: learning translations of words into foreign languages using visual data, e.g. “turtle” in English = “tortuga” in Spanish Main contribution: a totally new approach to building bilingual dictionaries [Bergsma and Van Durme, IJCAI 2011]

  29. English Web Images Spanish Web Images cockatoo vela turtle cacatúa candle tortuga

  30. Task #1: Bilingual Lexicon Induction • Why? • Needed for automatic machine translation, cross-language information retrieval, etc. • Poor coverage of human-compiled dictionaries/bitexts • How to do it with monolingual data only? • Link words to information that is preserved across languages (clues to common meaning)

  31. Clues to Common Meaning: Spelling [Koehn & Knight 2002, many others] natural-natural higiénico:hygenic radón-radon vela-candle *calle-candle

  32. Clues to Common Meaning: Images calle candle vela • Visual similarities: • high contrast • black background • glowing flame

  33. Link words by web-based visual similarity Step 1: Retrieve online images via Google Image Search (in each lang.), 20 images for each word • Google competitive with “hand-prepared datasets” [Fergus et al., 2005]

  34. Step 2: Create Image Feature Vectors Color histogram features

  35. Step 2: Create Image Feature Vectors SIFTkeypoint features Using David Lowe’s software [Lowe, 2004]

  36. Step 3: Compute an Aggregate Similarity for Two Words Vector Cosine Similarity 0.33 0.55 Avg. over all English images Best match for one English image 0.19 0.46

  37. Output: Ranking of Foreign Translations by Aggregate Visual Similarities

  38. Experiments • 500-word lists in each language • Results on allpairs from German, English, Spanish, French, Italian, Dutch • Avg. Top-N Accuracy: How often correct answer is in top N most similar words? • Lots more details in paper, including how we determine which words are ‘physical objects’

  39. Average Top-N Accuracy on 14 Language Pairs

  40. Task #2: Lexical Semantics from Images • Selectional Preference: • Is noun X a plausible object for verb Y? Can you eat “migas”? Can you eat “carillon”? Can you eat “mamey”? • [Bergsma and Goebel, RANLP 2011]

  41. Conclusion • Robust NLP needs to look beyond human-annotated data to exploit large corpora • Size matters: • Most parsing systems trained on 1 million words • We use: • billions of words in bitexts • trillions of words of monolingual text • online images: hundreds of billions (⨯1000 words each  a 100 trillion words!)

  42. Questions + Thanks • Gold sponsors: • Platinum sponsors (collaborators): • Kenneth Church (Johns Hopkins), Randy Goebel (Alberta), Dekang Lin (Google), Emily Pitler (Penn), Benjamin Van Durme (Johns Hopkins) and David Yarowsky (Johns Hopkins)

More Related