890 likes | 1k Views
E N D
1. 6/1/2012 1 Practical Language Modeling:A really good language modelEverything you know is wrongEverything you know is right Joshua Goodman
Microsoft Research
2. 2 A bad language model
3. 3 A bad language model
4. 4 A bad language model
5. 5 A bad language model
6. 6 Introduction Language models are great fun
An excellent way to try machine learning
Used for all sorts of things
Speech Recognition
Converting Chinese phonetics to characters
Machine translation
Spelling Correction
Etc.
7. 7 Introduction continued Well concentrate on speech recognition
Its what Ive done
Its the most common use
Its the use that has received the most research
Most of what I say here applies to other applications
8. 8 Overall Overview:This talk is like a movie Movie plot:
Boy meets girl
Boy loses girl
Boy gets girl
Credits/The making of the movie
Talk plot
A really good language model
Why its useless (everything you learned is wrong)
Why it isnt useless (everything you learned is hard)
Practical tips (the making of a really good language model)
9. 9 Overview A really good language model
My recent research
how I combined all the techniques youve studied into one really good language model
Caching, clustering, smoothing, skipping, sentence mixture models, 5-grams
10. 10 Everything you learned is wrong Speech recognizer and language model interact
Speech recognizer mechanics make it very hard to implement these things in products
11. 11 Everything you learned is right With some tricks, you can actually implement a lot of this stuff in a speech recognizer
Need to make compromises, change things
12. 12 Language model tricks How to make a language model work in practice
Especially useful tricks for research cheating sort of
13. 13 The best language model ever I set out to build the best language model ever
I just put together everything everyone else already did
Proves you dont have to be smart to do publishable research
14. 14 Language model techniques Clustering
Higher order n-grams
Smoothing
Caching
Skipping
Sentence Mixture Models
Combination of all of them
15. 15 Clustering:9 techniques x,y,z be words, X,Y,Z be clusters
PIBM(z|xy) = P(Z|XY) ? P(z|Z) + P(z|xy)
PfullIBM(z|xy) = P(Z|XY) ? P(z|XYZ) + P(z|xy)
Ppredict(z|xy) = P(Z|xy) ? P(z|xyZ)
Pindex(z|xy) = P(z|xXyY)
Pindexpredict(z|xXyY) = P(Z|xXyY) ? P(z|xXyYZ)
Pfullibmcombine(z|xy) = (P(Z|xy)+P(Z|XY)) ?
(P(z|xyZ)+P(z|XYZ))
16. 16
17. 17 Higher order n-grams
18. 18 Caching Unigram cache just uses P(z)
Bigram cache uses P(z|y)
Conditional bigram cache uses P(z|y) only if y is in cache
19. 19 Cache results
20. 20 Skipping Trigram-like skipping
P(z|uvwxy) ? P(z|xy)
P(z|uvwxy) ? P(z|w_y)
P(shower|celebrate Marys baby)
P(z|uvwxy) ? P(z|wx_)
Interpolate all together:
P(z|uvwxy) ? P(z|xy) + P(z|w_y) + P(z|wx_)
21. 21 Trigram skipping results
22. 22 5-gram skipping results
23. 23 Sentence Mixture Models Lots of different sentence types:
Numbers (The Dow rose one hundred seventy three points)
Quotations (Officials said quote we deny all wrong doing quote)
Mergers (AOL and Time Warner, in an attempt to control the media and the internet, will merge)
Model each sentence type separately
24. 24 Sentence Mixture Models Roll a die to pick sentence type, sk
with probability ??k
Probability of sentence, given sk
Probability of sentence across types:
25. 25 Sentence Model Smoothing Each topic model is smoothed with overall model.
Sentence mixture model is smoothed with overall model (sentence type 0).
26. 26 Sentence Clustering Same algorithm as word clustering
Assign each sentence to a type, sk
Minimize perplexity of P(z|sk ) instead of P(z|Y)
27. 27 Topic Examples - 0(Mergers and acquisitions) JOHN BLAIR &ERSAND COMPANY IS CLOSE TO AN AGREEMENT TO SELL ITS T. V. STATION ADVERTISING REPRESENTATION OPERATION AND PROGRAM PRODUCTION UNIT TO AN INVESTOR GROUP LED BY JAMES H. ROSENFIELD ,COMMA A FORMER C. B. S. INCORPORATED EXECUTIVE ,COMMA INDUSTRY SOURCES SAID .PERIOD
INDUSTRY SOURCES PUT THE VALUE OF THE PROPOSED ACQUISITION AT MORE THAN ONE HUNDRED MILLION DOLLARS .PERIOD
JOHN BLAIR WAS ACQUIRED LAST YEAR BY RELIANCE CAPITAL GROUP INCORPORATED ,COMMA WHICH HAS BEEN DIVESTING ITSELF OF JOHN BLAIR'S MAJOR ASSETS .PERIOD
JOHN BLAIR REPRESENTS ABOUT ONE HUNDRED THIRTY LOCAL TELEVISION STATIONS IN THE PLACEMENT OF NATIONAL AND OTHER ADVERTISING .PERIOD
MR. ROSENFIELD STEPPED DOWN AS A SENIOR EXECUTIVE VICE PRESIDENT OF C. B. S. BROADCASTING IN DECEMBER NINETEEN EIGHTY FIVE UNDER A C. B. S. EARLY RETIREMENT PROGRAM .PERIOD
28. 28 Topic Examples - 1(production, promotions, commas) MR. DION ,COMMA EXPLAINING THE RECENT INCREASE IN THE STOCK PRICE ,COMMA SAID ,COMMA "DOUBLE-QUOTE OBVIOUSLY ,COMMA IT WOULD BE VERY ATTRACTIVE TO OUR COMPANY TO WORK WITH THESE PEOPLE .PERIOD
BOTH MR. BRONFMAN AND MR. SIMON WILL REPORT TO DAVID G. SACKS ,COMMA PRESIDENT AND CHIEF OPERATING OFFICER OF SEAGRAM .PERIOD
JOHN A. KROL WAS NAMED GROUP VICE PRESIDENT ,COMMA AGRICULTURE PRODUCTS DEPARTMENT ,COMMA OF THIS DIVERSIFIED CHEMICALS COMPANY ,COMMA SUCCEEDING DALE E. WOLF ,COMMA WHO WILL RETIRE MAY FIRST .PERIOD
MR. KROL WAS FORMERLY VICE PRESIDENT IN THE AGRICULTURE PRODUCTS DEPARTMENT .PERIOD
RAPESEED ,COMMA ALSO KNOWN AS CANOLA ,COMMA IS CANADA'S MAIN OILSEED CROP .PERIOD
YALE E. KEY IS A WELL -HYPHEN SERVICE CONCERN .PERIOD
29. 29 Topic Examples - 2(Numbers) SOUTH KOREA POSTED A SURPLUS ON ITS CURRENT ACCOUNT OF FOUR HUNDRED NINETEEN MILLION DOLLARS IN FEBRUARY ,COMMA IN CONTRAST TO A DEFICIT OF ONE HUNDRED TWELVE MILLION DOLLARS A YEAR EARLIER ,COMMA THE GOVERNMENT SAID .PERIOD
THE CURRENT ACCOUNT COMPRISES TRADE IN GOODS AND SERVICES AND SOME UNILATERAL TRANSFERS .PERIOD
COMMERCIAL -HYPHEN VEHICLE SALES IN ITALY ROSE ELEVEN .POINT FOUR %PERCENT IN FEBRUARY FROM A YEAR EARLIER ,COMMA TO EIGHT THOUSAND ,COMMA EIGHT HUNDRED FORTY EIGHT UNITS ,COMMA ACCORDING TO PROVISIONAL FIGURES FROM THE ITALIAN ASSOCIATION OF AUTO MAKERS .PERIOD
INDUSTRIAL PRODUCTION IN ITALY DECLINED THREE .POINT FOUR %PERCENT IN JANUARY FROM A YEAR EARLIER ,COMMA THE GOVERNMENT SAID .PERIOD
CANADIAN MANUFACTURERS' NEW ORDERS FELL TO TWENTY .POINT EIGHT OH BILLION DOLLARS (LEFT-PAREN CANADIAN )RIGHT-PAREN IN JANUARY ,COMMA DOWN FOUR %PERCENT FROM DECEMBER'S TWENTY ONE .POINT SIX SEVEN BILLION DOLLARS ON A SEASONALLY ADJUSTED BASIS ,COMMA STATISTICS CANADA ,COMMA A FEDERAL AGENCY ,COMMA SAID .PERIOD
THE DECREASE FOLLOWED A FOUR .POINT FIVE %PERCENT INCREASE IN DECEMBER .PERIOD
30. 30 Topic Examples 3(quotations) NEITHER MR. ROSENFIELD NOR OFFICIALS OF JOHN BLAIR COULD BE REACHED FOR COMMENT .PERIOD
THE AGENCY SAID THERE IS "DOUBLE-QUOTE SOME INDICATION OF AN UPTURN "DOUBLE-QUOTE IN THE RECENT IRREGULAR PATTERN OF SHIPMENTS ,COMMA FOLLOWING THE GENERALLY DOWNWARD TREND RECORDED DURING THE FIRST HALF OF NINETEEN EIGHTY SIX .PERIOD
THE COMPANY SAID IT ISN'T AWARE OF ANY TAKEOVER INTEREST .PERIOD
THE SALE INCLUDES THE RIGHTS TO GERMAINE MONTEIL IN NORTH AND SOUTH AMERICA AND IN THE FAR EAST ,COMMA AS WELL AS THE WORLDWIDE RIGHTS TO THE DIANE VON FURSTENBERG COSMETICS AND FRAGRANCE LINES AND U. S. DISTRIBUTION RIGHTS TO LANCASTER BEAUTY PRODUCTS .PERIOD
BUT THE COMPANY WOULDN'T ELABORATE .PERIOD
HEARST CORPORATION WOULDN'T COMMENT ,COMMA AND MR. GOLDSMITH COULDN'T BE REACHED .PERIOD
A MERRILL LYNCH SPOKESMAN CALLED THE REVISED QUOTRON AGREEMENT "DOUBLE-QUOTE A PRUDENT MANAGEMENT MOVE --DASH IT GIVES US A LITTLE FLEXIBILITY .PERIOD
31. 31 Sentence results
32. 32 Sentence results again
33. 33 Putting it all together
34. 34 Combining techniques Example: clustering
Pfullibmcombine(z|xy) = (P(Z|xy)+P(Z|XY)) ?
(P(z|xyZ)+P(z|XYZ))
P5-gram(z|vwxy)
Pfullibmcombine-5gram(z|vwxy) =
(P(Z|vwxy)+P(Z|VWXY)) ?
(P(z|vwxyZ)+P(z|VWXYZ))
Combination is combination of concepts, not just interpolation
35. 35 Combination experiments Combine model of skipping, caching, clustering, kneser-ney smoothing, 5-gram, sentence mixture
Try removing these individually
36. 36 Everything minus something
37. 37 Everything together
38. 38 Conclusion Caching is by far the most useful technique
Kneser-Ney smoothing is key
Kneser-Ney smoothing always works best
Sentence mixture models have a lot of potential, especially as data sizes increase
39. 39 Overview A really good language model
Why its useless (everything you learned is wrong)
Why it isnt useless (everything you learned is hard)
Practical tips (the making of a really good language model)
40. 40 Why everything you learned is wrong Speech recognizer and language model are tightly integrated
Hard to put all these techniques into the speech recognizer in practice
We need to examine speech recognizer mechanics to see why
41. 41 How a speech recognizer works: THE Equation
42. 42 How a Speech Recognizer Works In practice, its not so simple
Acoustic scoring and language model scoring are tightly integrated for thresholding
Otherwise, we would need to consider ALL possible word sequences, not just likely ones
43. 43 Speech recognizer mechanics Keep many
hypotheses alive
Find acoustic, language model scores
P(acoustics | truth = .3), P(truth | tell the) = .1
P(acoustics | soup = .2), P(soup | smell the) = .01
44. 44 5-grams 5-grams have lower perplexity than trigram
45. 45 Speech recognizer slowdowns Speech recognizer uses tricks (dynamic programming) to merge hypotheses
Trigram: Fivegram:
46. 46 Speech recognizer vs. n-gram Recognizer can threshold out bad hypotheses
Trigram works so much better than bigram, better thresholding, no slow-down
4-gram, 5-gram start to become expensive
47. 47 Speech recognizer with language model In theory,
In practice, language model is a better predictor -- acoustic probabilities arent real probabilities
In practice, penalize insertions
48. 48 Nasty details matter Many nasty details about speech recognizers affect language modeling
Most speech recognizers use a phonetic tree
Tree representation works badly with some language modeling techniques, especially clustering
49. 49 Phonetic Tree Representation
50. 50 Phonetic Tree Representation
51. 51 Phonetic Tree Representation
52. 52 Consider clustering methods IBM clustering:
Let x,y,z be words, X,Y,Z be clusters
Pibm(z|xy) =
? P (z|xy) + (1- ? )P (Z|XY) ? P(z|Z)
Pibm (truth|tell the)=
? P (truth|tell the) +
(1-?)P (STATEMENT|COMMUNICATE ART.) ? P(truth|STATEMENT)
53. 53 Clustered Phonetic Tree Representation ???
54. 54 Sentence Mixture Modelin speech recognizer Keep hypotheses alive for each sentence type.
With 5 sentence types, 5 times as much work
55. 55 Count Cutoffs:Why smoothing doesnt matter Most dictation systems are trained on billions of words of training data, which would use about a gigabyte
I dont have a gigabyte
Solution: count cutoffs
Discard n-grams with fewer than k counts
? Keep Counts(New York City)>1000
? Discard Counts(New York paper)=2
56. 56 Count Cutoffs Smoothing only matters on small counts
For large C(XY), almost all smoothing algorithms assign nearly
P(Y|X) ? C(XY)/C(X)
So, with large count cutoffs, doesnt make much difference which smoothing technique you use
57. 57 Caching doesnt matter either Caching gets huge wins in theory
In practice, errors tend to get locked in
User says: Recognize speech
System hears: Wreck a nice beach
User says: Speech recognition
System hears: Beach wreck ignition
58. 58 Solution in theory: Make users correct mistakes User says: Recognize speech
System hears: Wreck a nice beach
User says: Change beach to speech
User says: Speech recognition
System hears: Speech wreck ignition
59. 59 Problems: User doesnt notice mistake
User doesnt feel like correcting mistake now
User says change beach to speech
System hears change beach to speak
User says Who do I recognize
User says Change who to whom
System is confused
60. 60 Dont forget size Space is also ignored
5-grams, most clustering, sentence mixture models, skipping models all use more space
Example: clustering
Pibm(z|xy) =
? P (z|xy) + (1- ? )P (Z|XY) ? P(z|Z)
Need to store two models instead of 1
61. 61 Why everything you know is wrong Speech recognition mechanics interact badly with many techniques including 5-grams, clustering, and sentence mixture models
Cache errors get locked in
Smoothing doesnt matter with high count cutoffs
In practice, the details are important
62. 62 Overall Overview A really good language model
Why its useless (everything you learned is wrong)
Why it isnt useless (everything you learned is hard)
Practical tips (the making of a really good language model)
63. 63 Why everything you learned is right Why everything you learned is right, just not always, and its harder than you thought it would be
Prof. Ostendorf isnt cruel
Language modeling research isnt useless
I spend my own time working on these problems Im just bitter and cynical
64. 64 Why smoothing is useful For large vocabulary dictation, we have billions of words for training
Most of it is newspaper text, or encylopedias
Not too many people are reporters or encyclopedia authors
We would kill for 1 million words of real data, and we would use all of it, or nearly so very low count cutoffs. Good smoothing would help.
65. 65 Why smoothing is useful For anything except dictation, situation is even worse.
Each new application needs its own language model training data
Travel, weather, stocks, news, telephone-based email access each language model is different
Requires painful, expensive transcription
Every piece of data is useful: cant afford high count cutoffs or bad smoothing
66. 66 Speed solution:Multiple pass decoding Speed is major problem for 5-grams, sentence mixture models, and some forms of clustering
We can use a multiple pass approach
First pass uses a normal trigram
Recognizer outputs best 100 hypotheses
These are then rescored
67. 67 N-best re-recognition User says Swear to tell the truth
Recognizer outputs
Swerve to smell the soup (A=.0005, L=.001)
Swear to tell the soup (A=.0003, L=.0001)
Swear to tell the truth (A=.0002, L=.0001)
Swerve to smell the truth (A=.0004, L=.00002)
Language model rescores each hypothesis
68. 68 N-best re-recognition problems What if right hypothesis is not present?
Exponentially many hypotheses:
Swear to tell the truth about speech recognition
Swear/swerve; to/2/two/too; tell/smell/bell;truth/soup/tooth;
speech recognition/beach wreck ignition
69. 69 Recognizer outputs a lattice:
Swear to smell soup
Swerve too tell the truth
2 spell tooth
First step is to expand the lattice so that it contains only trigram level dependencies
Lattice rescoring
70. 70 Lattice with trigrams 2
Swear to
tell
Swerve to tell
2
71. 71 Lattice/n-best problems All rescoring has to be done at end of recognition; Leads to latency: time between when user stops speaking and when he gets a response. Only fast techniques can be used for most apps.
Recognizer output can change:
Swerve to smell the soup about beach
?Swear to tell the truth about speech
Right answer might not be anywhere in lattice
72. 72 Lattice/n-best advantages Great for doing research!
Recognizer and language model can be completely separate
Very complex models can be used
Used by some products
73. 73 Clusters in practice IBM: Hard way to do clustering interacts badly with phonetic tree
Alternative: use trigram that backs off to bigram that backs off to P(Z|Y)
By picking right form for clustering, we can integrate with the speech recognizer
74. 74 Phonetic Tree with backoff
75. 75 Clustered phonetic tree with backoff
76. 76 Conclusion:Everything you learned is useful, just hard Everything you learned is useful
But its much more work to use it in practice than you thought
Need to pay careful attention to the speech recognizer or other application to integrate it
77. 77 Overview A really good language model
Why its useless (everything you learned is wrong)
Why it isnt useless (everything you learned is hard)
Practical tips (the making of a really good language model)
78. 78 Practical tricks for language modeling Cheating experiments
Great for research
Parameter optimization
Use a general gradient descent method
Useful for language modeling, other techniques
79. 79 Cheating experiments: great for research Read through all of the test data
Determine which counts you need
Read through the training data
Keep only what you need
Saves huge amounts of memory
80. 80 How to cheat Example: absolute discounting
Test data has P(tell|swear to)
Find all examples of swear to X
Compute C(swear to tell ) and C(swear to ), as well as how many X such that swear to X
Dont need to store anything that doesnt match one of these
81. 81 Dont really cheat! When you are reading the test data, its easy to really cheat by accident.
Make sure you can run some version without cheating so you can check for same results
82. 82 How to cheat sampling counts Some smoothing techniques, like Katz smoothing, or some versions of Kneser-Ney, need the counts of counts how many 1 counts, how many 2 counts, etc.
Sample counts (not data):
keep C(xyz) only if hash(xyz) mod s = 0
Saves lots of memory, but is more work
83. 83 Parameter Optimization Powell Search is really useful
A general search technique using numerical gradient descent
Can optimize smoothing parameters, optimization weights, etc
Any continuous variable of which perplexity is a continuous function
Dont let your heldout size be too big slow
84. 84 Parameter Optimization Anyone doing any research should have Powells algorithm (or similar) easily available
youll always need to optimize some parameters
Much easier than implementing special purpose routines, like EM
Local minima and other problems
All techniques reach local minima
85. 85 Practical tricks conclusion Cheating experiments save lots of memory
Powell search is great
86. 86 Conclusion:A really good language model Combining everything is useful
Caches are really useful in theory, less useful in practice
Kneser-Ney smoothing, 5-grams, very helpful
Sentence mixture models have lots of potential
87. 87 Conclusion:Everything you know is wrong But everything you know is wrong
The mechanics of speech recognition are important they cannot be neglected when doing language modeling research
Many techniques interact badly with recognizer, either with the search and pruning used, or with the phonetic trees
88. 88 Conclusion:Everything you know is right Do pure research and dont worry about it
Restructure the recognition to use lattices or n-best
Modify your technique, like we did with clustering.
89. 89 Conclusion:Practical Tricks Cheating techniques can make language modeling research much easier.
Powells algorithm or similar is great for language modeling research, or research in general
90. 90 Conclusion conclusion Language model research is hard
Everything interacts
Practical considerations are important
Important for making practical impact
Important for building a research system