1 / 77

Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that

Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that you’ve never heard of. Speech/EE Meeting. Language/CS Meeting. Degree of difficulty. Some queries are harder than others More bits? (Long tail: long/infrequent)

marla
Download Presentation

Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Google-like Search on Spoken Documents with Zero ResourcesHow to get something from nothingin a language that you’ve never heard of Speech/EE Meeting Language/CS Meeting

  2. Degree of difficulty • Some queries are harder than others • More bits? (Long tail: long/infrequent) • Or less bits? (Big fat head: short/frequent) • Query refinement/Coping Mechanisms: • If query doesn’t work • Should you make it longer or shorter? • Solitaire Multi-Player Game: • Readers, Writers, Market Makers, etc. • Good Guys & Bad Guys: Advertisers & Spammers

  3. Spoken Web Search is Easy Compared to Dictation

  4. Entropy of Search Logs-How Big is the Web?- How Hard is Search? - With Personalization? With Backoff? Qiaozhu Mei†, Kenneth Church‡ † University of Illinois at Urbana-Champaign ‡ Microsoft Research

  5. Small How Bigis the Web?5B? 20B? More? Less? • What if a small cache of millions of pages • Could capture much of the value of billions? • Could a Big bet on a cluster in the clouds • Turn into a big liability? • Examples of Big Bets • Computer Centers & Clusters • Capital (Hardware) • Expense (Power) • Dev (Mapreduce, GFS, Big Table, etc.) • Sales & Marketing >> Production & Distribution • Goal: Maximize Sales (Not Inventory)

  6. Millions (Not Billions)

  7. Population Bound • With all the talk about the Long Tail • You’d think that the Web was astronomical • Carl Sagan: Billions and Billions… • Lower Distribution $$  Sell Less of More • But there are limits to this process • NetFlix: 55k movies (not even millions) • Amazon: 8M products • Vanity Searches: Infinite??? • Personal Home Pages << Phone Book < Population • Business Home Pages << Yellow Pages < Population • Millions, not Billions (until market saturates)

  8. It Will Take Decades to Reach Population Bound • Most people (and products) • don’t have a web page (yet) • Currently, I can find famous people • (and academics) • but not my neighbors • There aren’t that many famous people • (and academics)… • Millions, not billions • (for the foreseeable future)

  9. If there is a page on the web, And no one sees it, Did it make a sound? How big is the web? Should we count “silent” pages That don’t make a sound? How many products are there? Do we count “silent” flops That no one buys? Equilibrium: Supply = Demand

  10. Demand Side Accounting • Consumers have limited time • Telephone Usage: 1 hour per line per day • TV: 4 hours per day • Web: ??? hours per day • Suppliers will post as many pages as consumers can consume (and no more) • Size of Web: O(Consumers)

  11. How Big is the Web? • Related questions come up in language • How big is English? • Dictionary Marketing • Education (Testing of Vocabulary Size) • Psychology • Statistics • Linguistics • Two Very Different Answers • Chomsky: language is infinite • Shannon: 1.25 bits per character How many words do people know? What is a word? Person? Know?

  12. Chomskian Argument: Web is Infinite • One could write a malicious spider trap • http://successor.aspx?x=0 http://successor.aspx?x=1http://successor.aspx?x=2 • Not just academic exercise • Web is full of benign examples like • http://calendar.duke.edu/ • Infinitely many months • Each month has a link to the next

  13. How Bigis the Web?5B? 20B? More? Less? • More (Chomsky) • http://successor?x=0 • Less (Shannon) MSN Search Log 1 month  x18 More Practical Answer Comp Ctr ($$$$)  Walk in the Park ($) Millions (not Billions) Cluster in Cloud  Desktop  Flash

  14. Entropy (H) • Size of search space; difficulty of a task • H = 20  1 million items distributed uniformly • Powerful tool for sizing challenges and opportunities • How hard is search? • How much does personalization help?

  15. How Hard Is Search?Millions, not Billions • Traditional Search • H(URL | Query) • 2.8 (= 23.9 – 21.1) • Personalized Search • H(URL | Query, IP) • 1.2 (= 27.2 – 26.0) Personalization cuts H in Half!

  16. How Hard are Query Suggestions?The Wild Thing? C* Rice  Condoleezza Rice • Traditional Suggestions • H(Query) • 21 bits • Personalized • H(Query | IP) • 5 bits (= 26 – 21) Personalization cuts H in Half! Twice

  17. Conclusions: Millions (not Billions) • How Big is the Web? • Upper bound: O(Population) • Not Billions • Not Infinite • Shannon >> Chomsky • How hard is search? • Query Suggestions? • Personalization? • Cluster in Cloud ($$$$)  Walk-in-the-Park ($) • Goal: Maximize Sales (Not Inventory) Entropy is a great hammer

  18. Search Companies have massive resources (logs) • “Unfair” advantage: logs • Logs are the crown jewels • Search companies care so much about data collection that… • Toolbar: business case is all about data collection • The $64B question: • What (if anything) can we do without resources?

  19. Towards Google-like Search on Spoken Documents with Zero ResourcesHow to get something from nothingin a language that you’ve never heard of Speech/EE Meeting Language/CS Meeting

  20. Definitions ? • Towards: • Not there yet • Zero Resources: • No nothing (no knowledge of language/domain) • The next crisis will be where we are least prepared • No training data, no dictionaries, no models, no linguistics • Motivate a New Task: Spoken Term Discovery • Spoken Term Detection (Word Spotting): Standard Task • Find instances of spoken phrase in spoken document • Input: spoken phrase + spoken document • Spoken Term Discovery: Non-Standard task • Input: spoken document (without spoken phrase) • Output: spoken phrases (interesting intervals in document)

  21. The next crisis will be where we are least prepared ?

  22. RECIPE FOR REVOLUTION • 2 Cups of a long-standing leader • 1 Cup of an aggravated population • 1 Cup of police brutality • ½ teaspoon of video • Bake in YouTube for 2-3 months • Flambé with Facebook and sprinkle on some Twitter

  23. Longer is Easier • female • female encyclopedias /iy/ • male • male

  24. What makes an interval of speech interesting? • Cues from text processing: • Long (~ 1 sec such as “University of Pennsylvania”) • Repeated • Bursty (tf * IDF) • tf: lots of repetitions within documents • IDF: with relatively few repetitions across documents • Unique to speech processing: • Given-New: • First mention is articulated more carefully than subsequent • Dialog between two parties (A & B): multi-player game • A: utters an important phrase • B: what? • A: repeats the important phrase

  25. n2 Time & Space(Update: ASRU-2011) • But the constants are attractive • Sparsity • Redesigned algorithms to take advantage of sparsity • Median Filtering • Hough Transform • Line Segment Search • Approx Nearest Neighbors (ASRU-2012)

  26. NLP on Spoken Documents Without Automatic Speech Recognition Mark Dredze, Aren Jansen,Glen Coppersmith, Ken Church EMNLP-2010

  27. At least not for all NLP tasks We Don’t Need Speech Recognition To Process Speech

More Related