1 / 89

Practical Language Modeling: A really good language model Everything you know is wrong Everything you know is right

lotte
Download Presentation

Practical Language Modeling: A really good language model Everything you know is wrong Everything you know is right

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. 6/1/2012 1 Practical Language Modeling: A really good language model Everything you know is wrong Everything you know is right Joshua Goodman Microsoft Research

    2. 2 A bad language model

    3. 3 A bad language model

    4. 4 A bad language model

    5. 5 A bad language model

    6. 6 Introduction Language models are great fun An excellent way to try machine learning Used for all sorts of things Speech Recognition Converting Chinese phonetics to characters Machine translation Spelling Correction Etc.

    7. 7 Introduction – continued We’ll concentrate on speech recognition It’s what I’ve done It’s the most common use It’s the use that has received the most research Most of what I say here applies to other applications

    8. 8 Overall Overview: This talk is like a movie Movie plot: Boy meets girl Boy loses girl Boy gets girl Credits/The making of the movie Talk plot A really good language model Why it’s useless (everything you learned is wrong) Why it isn’t useless (everything you learned is hard) Practical tips (the making of a really good language model)

    9. 9 Overview A really good language model My recent research how I combined all the techniques you’ve studied into one really good language model Caching, clustering, smoothing, skipping, sentence mixture models, 5-grams

    10. 10 Everything you learned is wrong Speech recognizer and language model interact Speech recognizer mechanics make it very hard to implement these things in products

    11. 11 Everything you learned is right With some tricks, you can actually implement a lot of this stuff in a speech recognizer Need to make compromises, change things

    12. 12 Language model tricks How to make a language model work in practice Especially useful tricks for research – cheating sort of

    13. 13 The best language model ever I set out to build the best language model ever I just put together everything everyone else already did Proves you don’t have to be smart to do publishable research

    14. 14 Language model techniques Clustering Higher order n-grams Smoothing Caching Skipping Sentence Mixture Models Combination of all of them

    15. 15 Clustering: 9 techniques x,y,z be words, X,Y,Z be clusters PIBM(z|xy) = P(Z|XY) ? P(z|Z) + P(z|xy) PfullIBM(z|xy) = P(Z|XY) ? P(z|XYZ) + P(z|xy) Ppredict(z|xy) = P(Z|xy) ? P(z|xyZ) Pindex(z|xy) = P(z|xXyY) Pindexpredict(z|xXyY) = P(Z|xXyY) ? P(z|xXyYZ) Pfullibmcombine(z|xy) = (P(Z|xy)+P(Z|XY)) ? (P(z|xyZ)+P(z|XYZ))

    16. 16

    17. 17 Higher order n-grams

    18. 18 Caching Unigram cache just uses P(z) Bigram cache uses P(z|y) Conditional bigram cache uses P(z|y) only if y is in cache

    19. 19 Cache results

    20. 20 Skipping Trigram-like skipping P(z|uvwxy) ? P(z|xy) P(z|uvwxy) ? P(z|w_y) P(shower|celebrate Mary’s baby) P(z|uvwxy) ? P(z|wx_) Interpolate all together: P(z|uvwxy) ? P(z|xy) + P(z|w_y) + P(z|wx_)

    21. 21 Trigram skipping results

    22. 22 5-gram skipping results

    23. 23 Sentence Mixture Models Lots of different sentence types: Numbers (The Dow rose one hundred seventy three points) Quotations (Officials said “quote we deny all wrong doing ”quote) Mergers (AOL and Time Warner, in an attempt to control the media and the internet, will merge) Model each sentence type separately

    24. 24 Sentence Mixture Models Roll a die to pick sentence type, sk with probability ??k Probability of sentence, given sk Probability of sentence across types:

    25. 25 Sentence Model Smoothing Each topic model is smoothed with overall model. Sentence mixture model is smoothed with overall model (sentence type 0).

    26. 26 Sentence Clustering Same algorithm as word clustering Assign each sentence to a type, sk Minimize perplexity of P(z|sk ) instead of P(z|Y)

    27. 27 Topic Examples - 0 (Mergers and acquisitions) JOHN BLAIR &ERSAND COMPANY IS CLOSE TO AN AGREEMENT TO SELL ITS T. V. STATION ADVERTISING REPRESENTATION OPERATION AND PROGRAM PRODUCTION UNIT TO AN INVESTOR GROUP LED BY JAMES H. ROSENFIELD ,COMMA A FORMER C. B. S. INCORPORATED EXECUTIVE ,COMMA INDUSTRY SOURCES SAID .PERIOD INDUSTRY SOURCES PUT THE VALUE OF THE PROPOSED ACQUISITION AT MORE THAN ONE HUNDRED MILLION DOLLARS .PERIOD JOHN BLAIR WAS ACQUIRED LAST YEAR BY RELIANCE CAPITAL GROUP INCORPORATED ,COMMA WHICH HAS BEEN DIVESTING ITSELF OF JOHN BLAIR'S MAJOR ASSETS .PERIOD JOHN BLAIR REPRESENTS ABOUT ONE HUNDRED THIRTY LOCAL TELEVISION STATIONS IN THE PLACEMENT OF NATIONAL AND OTHER ADVERTISING .PERIOD MR. ROSENFIELD STEPPED DOWN AS A SENIOR EXECUTIVE VICE PRESIDENT OF C. B. S. BROADCASTING IN DECEMBER NINETEEN EIGHTY FIVE UNDER A C. B. S. EARLY RETIREMENT PROGRAM .PERIOD

    28. 28 Topic Examples - 1 (production, promotions, commas) MR. DION ,COMMA EXPLAINING THE RECENT INCREASE IN THE STOCK PRICE ,COMMA SAID ,COMMA "DOUBLE-QUOTE OBVIOUSLY ,COMMA IT WOULD BE VERY ATTRACTIVE TO OUR COMPANY TO WORK WITH THESE PEOPLE .PERIOD BOTH MR. BRONFMAN AND MR. SIMON WILL REPORT TO DAVID G. SACKS ,COMMA PRESIDENT AND CHIEF OPERATING OFFICER OF SEAGRAM .PERIOD JOHN A. KROL WAS NAMED GROUP VICE PRESIDENT ,COMMA AGRICULTURE PRODUCTS DEPARTMENT ,COMMA OF THIS DIVERSIFIED CHEMICALS COMPANY ,COMMA SUCCEEDING DALE E. WOLF ,COMMA WHO WILL RETIRE MAY FIRST .PERIOD MR. KROL WAS FORMERLY VICE PRESIDENT IN THE AGRICULTURE PRODUCTS DEPARTMENT .PERIOD RAPESEED ,COMMA ALSO KNOWN AS CANOLA ,COMMA IS CANADA'S MAIN OILSEED CROP .PERIOD YALE E. KEY IS A WELL -HYPHEN SERVICE CONCERN .PERIOD

    29. 29 Topic Examples - 2 (Numbers) SOUTH KOREA POSTED A SURPLUS ON ITS CURRENT ACCOUNT OF FOUR HUNDRED NINETEEN MILLION DOLLARS IN FEBRUARY ,COMMA IN CONTRAST TO A DEFICIT OF ONE HUNDRED TWELVE MILLION DOLLARS A YEAR EARLIER ,COMMA THE GOVERNMENT SAID .PERIOD THE CURRENT ACCOUNT COMPRISES TRADE IN GOODS AND SERVICES AND SOME UNILATERAL TRANSFERS .PERIOD COMMERCIAL -HYPHEN VEHICLE SALES IN ITALY ROSE ELEVEN .POINT FOUR %PERCENT IN FEBRUARY FROM A YEAR EARLIER ,COMMA TO EIGHT THOUSAND ,COMMA EIGHT HUNDRED FORTY EIGHT UNITS ,COMMA ACCORDING TO PROVISIONAL FIGURES FROM THE ITALIAN ASSOCIATION OF AUTO MAKERS .PERIOD INDUSTRIAL PRODUCTION IN ITALY DECLINED THREE .POINT FOUR %PERCENT IN JANUARY FROM A YEAR EARLIER ,COMMA THE GOVERNMENT SAID .PERIOD CANADIAN MANUFACTURERS' NEW ORDERS FELL TO TWENTY .POINT EIGHT OH BILLION DOLLARS (LEFT-PAREN CANADIAN )RIGHT-PAREN IN JANUARY ,COMMA DOWN FOUR %PERCENT FROM DECEMBER'S TWENTY ONE .POINT SIX SEVEN BILLION DOLLARS ON A SEASONALLY ADJUSTED BASIS ,COMMA STATISTICS CANADA ,COMMA A FEDERAL AGENCY ,COMMA SAID .PERIOD THE DECREASE FOLLOWED A FOUR .POINT FIVE %PERCENT INCREASE IN DECEMBER .PERIOD

    30. 30 Topic Examples – 3 (quotations) NEITHER MR. ROSENFIELD NOR OFFICIALS OF JOHN BLAIR COULD BE REACHED FOR COMMENT .PERIOD THE AGENCY SAID THERE IS "DOUBLE-QUOTE SOME INDICATION OF AN UPTURN "DOUBLE-QUOTE IN THE RECENT IRREGULAR PATTERN OF SHIPMENTS ,COMMA FOLLOWING THE GENERALLY DOWNWARD TREND RECORDED DURING THE FIRST HALF OF NINETEEN EIGHTY SIX .PERIOD THE COMPANY SAID IT ISN'T AWARE OF ANY TAKEOVER INTEREST .PERIOD THE SALE INCLUDES THE RIGHTS TO GERMAINE MONTEIL IN NORTH AND SOUTH AMERICA AND IN THE FAR EAST ,COMMA AS WELL AS THE WORLDWIDE RIGHTS TO THE DIANE VON FURSTENBERG COSMETICS AND FRAGRANCE LINES AND U. S. DISTRIBUTION RIGHTS TO LANCASTER BEAUTY PRODUCTS .PERIOD BUT THE COMPANY WOULDN'T ELABORATE .PERIOD HEARST CORPORATION WOULDN'T COMMENT ,COMMA AND MR. GOLDSMITH COULDN'T BE REACHED .PERIOD A MERRILL LYNCH SPOKESMAN CALLED THE REVISED QUOTRON AGREEMENT "DOUBLE-QUOTE A PRUDENT MANAGEMENT MOVE --DASH IT GIVES US A LITTLE FLEXIBILITY .PERIOD

    31. 31 Sentence results

    32. 32 Sentence results again

    33. 33 Putting it all together

    34. 34 Combining techniques Example: clustering Pfullibmcombine(z|xy) = (P(Z|xy)+P(Z|XY)) ? (P(z|xyZ)+P(z|XYZ)) P5-gram(z|vwxy) Pfullibmcombine-5gram(z|vwxy) = (P(Z|vwxy)+P(Z|VWXY)) ? (P(z|vwxyZ)+P(z|VWXYZ)) Combination is combination of concepts, not just interpolation

    35. 35 Combination experiments Combine model of skipping, caching, clustering, kneser-ney smoothing, 5-gram, sentence mixture Try removing these individually

    36. 36 Everything minus something

    37. 37 Everything together

    38. 38 Conclusion Caching is by far the most useful technique Kneser-Ney smoothing is key Kneser-Ney smoothing always works best Sentence mixture models have a lot of potential, especially as data sizes increase

    39. 39 Overview A really good language model Why it’s useless (everything you learned is wrong) Why it isn’t useless (everything you learned is hard) Practical tips (the making of a really good language model)

    40. 40 Why everything you learned is wrong Speech recognizer and language model are tightly integrated Hard to put all these techniques into the speech recognizer in practice We need to examine speech recognizer mechanics to see why

    41. 41 How a speech recognizer works: THE Equation

    42. 42 How a Speech Recognizer Works In practice, it’s not so simple Acoustic scoring and language model scoring are tightly integrated for thresholding Otherwise, we would need to consider ALL possible word sequences, not just likely ones

    43. 43 Speech recognizer mechanics Keep many hypotheses alive Find acoustic, language model scores P(acoustics | truth = .3), P(truth | tell the) = .1 P(acoustics | soup = .2), P(soup | smell the) = .01

    44. 44 5-grams 5-grams have lower perplexity than trigram

    45. 45 Speech recognizer slowdowns Speech recognizer uses tricks (dynamic programming) to merge hypotheses Trigram: Fivegram:

    46. 46 Speech recognizer vs. n-gram Recognizer can threshold out bad hypotheses Trigram works so much better than bigram, better thresholding, no slow-down 4-gram, 5-gram start to become expensive

    47. 47 Speech recognizer with language model In theory, In practice, language model is a better predictor -- acoustic probabilities aren’t “real” probabilities In practice, penalize insertions

    48. 48 Nasty details matter Many nasty details about speech recognizers affect language modeling Most speech recognizers use a phonetic tree Tree representation works badly with some language modeling techniques, especially clustering

    49. 49 Phonetic Tree Representation

    50. 50 Phonetic Tree Representation

    51. 51 Phonetic Tree Representation

    52. 52 Consider clustering methods IBM clustering: Let x,y,z be words, X,Y,Z be clusters Pibm(z|xy) = ? P (z|xy) + (1- ? )P (Z|XY) ? P(z|Z) Pibm (truth|tell the)= ? P (truth|tell the) + (1-?)P (STATEMENT|COMMUNICATE ART.) ? P(truth|STATEMENT)

    53. 53 Clustered Phonetic Tree Representation ???

    54. 54 Sentence Mixture Model in speech recognizer Keep hypotheses alive for each sentence type. With 5 sentence types, 5 times as much work

    55. 55 Count Cutoffs: Why smoothing doesn’t matter Most dictation systems are trained on billions of words of training data, which would use about a gigabyte I don’t have a gigabyte Solution: count cutoffs Discard n-grams with fewer than k counts ? Keep Counts(New York City)>1000 ? Discard Counts(New York paper)=2

    56. 56 Count Cutoffs Smoothing only matters on small counts For large C(XY), almost all smoothing algorithms assign nearly P(Y|X) ? C(XY)/C(X) So, with large count cutoffs, doesn’t make much difference which smoothing technique you use

    57. 57 Caching doesn’t matter either Caching gets huge wins in theory In practice, errors tend to get “locked in” User says: “Recognize speech” System hears: “Wreck a nice beach” User says: “Speech recognition” System hears: “Beach wreck ignition”

    58. 58 Solution in theory: Make users correct mistakes User says: “Recognize speech” System hears: “Wreck a nice beach” User says: “Change beach to speech” User says: “Speech recognition” System hears: “Speech wreck ignition”

    59. 59 Problems: User doesn’t notice mistake User doesn’t feel like correcting mistake now User says “change beach to speech” System hears “change beach to speak” User says “Who do I recognize” User says “Change who to whom” System is confused

    60. 60 Don’t forget size Space is also ignored 5-grams, most clustering, sentence mixture models, skipping models – all use more space Example: clustering Pibm(z|xy) = ? P (z|xy) + (1- ? )P (Z|XY) ? P(z|Z) Need to store two models instead of 1

    61. 61 Why everything you know is wrong Speech recognition mechanics interact badly with many techniques including 5-grams, clustering, and sentence mixture models Cache errors get “locked in” Smoothing doesn’t matter with high count cutoffs In practice, the details are important

    62. 62 Overall Overview A really good language model Why it’s useless (everything you learned is wrong) Why it isn’t useless (everything you learned is hard) Practical tips (the making of a really good language model)

    63. 63 Why everything you learned is right Why everything you learned is right, just not always, and it’s harder than you thought it would be Prof. Ostendorf isn’t cruel Language modeling research isn’t useless I spend my own time working on these problems – I’m just bitter and cynical

    64. 64 Why smoothing is useful For large vocabulary dictation, we have billions of words for training Most of it is newspaper text, or encylopedias Not too many people are reporters or encyclopedia authors We would kill for 1 million words of real data, and we would use all of it, or nearly so – very low count cutoffs. Good smoothing would help.

    65. 65 Why smoothing is useful For anything except dictation, situation is even worse. Each new application needs its own language model training data Travel, weather, stocks, news, telephone-based email access – each language model is different Requires painful, expensive transcription Every piece of data is useful: can’t afford high count cutoffs or bad smoothing

    66. 66 Speed solution: Multiple pass decoding Speed is major problem for 5-grams, sentence mixture models, and some forms of clustering We can use a multiple pass approach First pass uses a normal trigram Recognizer outputs best 100 hypotheses These are then rescored

    67. 67 N-best re-recognition User says “Swear to tell the truth” Recognizer outputs “Swerve to smell the soup” (A=.0005, L=.001) “Swear to tell the soup” (A=.0003, L=.0001) “Swear to tell the truth” (A=.0002, L=.0001) “Swerve to smell the truth” (A=.0004, L=.00002) Language model rescores each hypothesis

    68. 68 N-best re-recognition problems What if right hypothesis is not present? Exponentially many hypotheses: “Swear to tell the truth about speech recognition” Swear/swerve; to/2/two/too; tell/smell/bell;truth/soup/tooth; speech recognition/beach wreck ignition

    69. 69 Recognizer outputs a lattice: Swear to smell soup Swerve too tell the truth 2 spell tooth First step is to expand the lattice so that it contains only trigram level dependencies… Lattice rescoring

    70. 70 Lattice with trigrams 2 Swear to tell Swerve to tell 2

    71. 71 Lattice/n-best problems All rescoring has to be done at end of recognition; Leads to “latency”: time between when user stops speaking and when he gets a response. Only fast techniques can be used for most apps. Recognizer output can change: “Swerve to smell the soup about beach…” ?“Swear to tell the truth about speech…” Right answer might not be anywhere in lattice

    72. 72 Lattice/n-best advantages Great for doing research! Recognizer and language model can be completely separate Very complex models can be used Used by some products

    73. 73 Clusters in practice IBM: Hard way to do clustering – interacts badly with phonetic tree Alternative: use trigram that backs off to bigram that backs off to P(Z|Y) By picking right form for clustering, we can integrate with the speech recognizer

    74. 74 Phonetic Tree with backoff

    75. 75 Clustered phonetic tree with backoff

    76. 76 Conclusion:Everything you learned is useful, just hard Everything you learned is useful But it’s much more work to use it in practice than you thought Need to pay careful attention to the speech recognizer or other application to integrate it

    77. 77 Overview A really good language model Why it’s useless (everything you learned is wrong) Why it isn’t useless (everything you learned is hard) Practical tips (the making of a really good language model)

    78. 78 Practical tricks for language modeling Cheating experiments Great for research Parameter optimization Use a general gradient descent method Useful for language modeling, other techniques

    79. 79 Cheating experiments: great for research Read through all of the test data Determine which counts you need Read through the training data Keep only what you need Saves huge amounts of memory

    80. 80 How to cheat Example: absolute discounting Test data has P(tell|swear to) Find all examples of swear to X Compute C(swear to tell ) and C(swear to ), as well as how many X such that swear to X Don’t need to store anything that doesn’t match one of these

    81. 81 Don’t really cheat! When you are reading the test data, it’s easy to really cheat by accident. Make sure you can run some version without cheating so you can check for same results

    82. 82 How to cheat – sampling counts Some smoothing techniques, like Katz smoothing, or some versions of Kneser-Ney, need the “counts of counts” – how many 1 counts, how many 2 counts, etc. Sample counts (not data): keep C(xyz) only if hash(xyz) mod s = 0 Saves lots of memory, but is more work

    83. 83 Parameter Optimization Powell Search is really useful A general search technique using numerical gradient descent Can optimize smoothing parameters, optimization weights, etc – Any continuous variable of which perplexity is a continuous function Don’t let your heldout size be too big – slow

    84. 84 Parameter Optimization Anyone doing any research should have Powell’s algorithm (or similar) easily available you’ll always need to optimize some parameters Much easier than implementing special purpose routines, like EM Local minima and other problems All techniques reach local minima

    85. 85 Practical tricks conclusion Cheating experiments save lots of memory Powell search is great

    86. 86 Conclusion: A really good language model Combining everything is useful Caches are really useful in theory, less useful in practice Kneser-Ney smoothing, 5-grams, very helpful Sentence mixture models have lots of potential

    87. 87 Conclusion: Everything you know is wrong But everything you know is wrong The mechanics of speech recognition are important – they cannot be neglected when doing language modeling research Many techniques interact badly with recognizer, either with the search and pruning used, or with the phonetic trees

    88. 88 Conclusion: Everything you know is right Do pure research and don’t worry about it Restructure the recognition to use lattices or n-best Modify your technique, like we did with clustering.

    89. 89 Conclusion: Practical Tricks Cheating techniques can make language modeling research much easier. Powell’s algorithm or similar is great for language modeling research, or research in general

    90. 90 Conclusion conclusion Language model research is hard Everything interacts Practical considerations are important Important for making practical impact Important for building a research system

More Related