1 / 48

Hierarchical Bayesian Language Models based on Pitman-Yor Processes

This paper introduces a language model utilizing the Pitman-Yor Process, a Hierarchical Nonparametric Bayesian model that updates latent parameters through Bayesian inference. The model addresses overfitting issues of N-gram models. Learn the Bayesian Philosophy, Inference, Dirichlet Process, Chinese Restaurant Process, and more!

ccue
Download Presentation

Hierarchical Bayesian Language Models based on Pitman-Yor Processes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Hierarchical Bayesian Language Model basedon Pitman-YorProcesses A Rough Translation by Davis Cho, Jianxiong Wang, John Yuan, and NellyLyu

  2. Overview The main goal of this paper is to introduce a language model via the Pitman-Yor Process. This is a Hierarchical Nonparametric bayesian model , which utilizes Bayesian inference to update latent parameters. The Pitman Yor Process is simply a nonparametric extension of the Dirichlet process. The benefits of this language model is that is avoids the overfitting problems of N-gram models. However it is important to note that the Inference process for such process is quite complex and requires a Markov Chain Monte CarloSequence.

  3. Overview/Motivation • Previously in class, we discussed about certain ways to smoothen the distribution of the data points using different smoothing techniques suchas • Laplace: Assume all have been seenonce • Jelinek-Mercer aka Interpolation: Mixing higher order n-gram estimates with lower • Absolute Discount: Subtract discount fromfrequency • Kneser-Ney: Account for the number of previous contextwords • But Teh proposes a better way to dothings!

  4. Agenda Bayesian Philosophy/ BayesianInference Bayesian Parametric vs Bayesian NonParametric Bayesian HierarchicalModeling DirichletProcess Pitman YorProcess Chinese RestaurantProcess Hierarchical Chinese Restaurant forNLP Inference (GibbsSampling) ExperimentalResults

  5. PresentationTimings • Bayesian Overview- 20 minutes(Nelly) • BayesianExample • Prior and Posterior example (John Cho used thescale • ParametricExample • Non ParametricExample 2. Dirichlet 20 minutes (Davis?) Hierarchical DirichletProcess PitmanYorr HypotheticalBreak Chinese Restaurant Example 25 minutes (John) Why is it called ChineseRestaurant Rich get richer, preferentialattachment. Inference Gibbs Sampling (20 minutes)(Jianxiong)

  6. What is BayesianPhilosophy? • A Bayesian approach to a problem starts with the formulation of a model that we hope is adequate to describe the situation ofinterest. • We then formulated a prior distribution over the unknown parameters of the model, which is meant to capture our beliefs about the situation before seeing thedata. • After observing some data, we apply Bayes' Rule to obtain a posterior distribution for these unknowns, which takes account of both the prior and thedata.

  7. What does “Bayesian inference” evenmean? Bayesian inference = Guessing in the style ofBayes Bayesian inference is a method of statistical inferencein which Bayes' theoremis used to update the probability for a hypothesis as more evidenceor informationbecomesavailable.

  8. Steps of Bayesian Inference Identify the observed data you are workingwith. Construct a probabilistic model to represent the data(likelihood). Specify prior distributions over the parameters of your probabilistic model(prior). Collect data and apply Bayes’ rule to re-allocate credibility across thepossible parameter values(posterior).

  9. Bayesian Coin FlipsExample

  10. Prior

  11. Putting ittogether Posterior ∝Likelihood ×Prior

  12. BayesianParametric • Parametric models assume some fixed finite set of parameters Θ. Given the parameters, future predictions, x, are independent of the observed data,D: Therefore Θ capture everything there is to know about thedata.

  13. Let’s see a parametricexample E.g. Given some measurement points about mosquitoes in Asia,how many species arethere? K =4? K =5? K =6 K=8?

  14. What is Non ParametricMethods? • Non Parametric: modeling with an unbounded number ofparameters. • Why useful for NLP?

  15. Bayesian nonparametricexample MusicGenre

  16. N Gram ModelReview Let's take 3 Gram model asexample How do we find the probabilities for each context Goal is to predict next words givencontext.

  17. Hierarchical Bayes, Dirichlet, and Pitman-YorProcesses

  18. Hierarchical BayesModel Bayesian hierarchical modelling is a statistical modelwritten in multiple levels (hierarchical form) that estimates the parametersof the posterior distributionusing the Bayesian method(aka Bayesian Inference. More on thislater!) Very very simple view for now. More on this later aswell! P(w2) P(w1|w2) P(w2 |w2) P(w |w1,w2) P(w | w2,w2)

  19. DirichletProcess • Dirichlet Process - Essentially a “sample” distribution obtained from an original distribution such that the expected value of the sample distribution is the same as the expected value of the originaldistribution • The original distribution can be continuous or discrete, while the new distribution will be discrete • Let: be the original distribution we want to samplefrom • 𝞪 the “concentration” of the samplestaken • G the resultingdistribution • Often denoted as: G~DP(𝞪, )

  20. DirichletProcess • Let’s see anexample: • Let be a Normal Distribution with mean μ =0, and variance σ^2 =1, denotedas • With 𝞪 =1, G ~DP(𝞪 , N(0, 1)) would look somethinglike: =N(0,1) • Notice that the resulting distributions only slightly resemble the normal distribution.Why?

  21. Dirichlet Process 𝞪Value • 𝞪 is inversely proportional to how “concentrated” the datais • Higher 𝞪 is less concentrated while lower 𝞪is moreconcentrated • Why is this thecase? • Dirichlet Process draws sample from the original distribution according to the following criterion. In this context, H is the originaldistribution. • ○

  22. How 𝞪 Affects the NewDistribution • Equationa) • If alpha is low, we are less likely to draw a new value from H as the number of draws increases • If alpha is high, we are more likely to draw a new value fromH. • Equationb) • If alpha is low, we are more likely to pick a previously seenvalue • If alpha is high, we are less likely to pick a previously seenvalue • Thus, variance is higher when we have a higher alpha level because we pick more new valuesfrom • H.

  23. 𝞪Examples • Draws from the Dirichlet ProcessDP(N(0,1), • 𝞪) • The four rows use different 𝞪 (top to bottom: 1, 10, 100 and1000) • Each row contains three repetitions of the sameexperiment. • As seen from the graphs, draws from a Dirichlet process are discrete distributions and they become less concentrated (more spread out) with increasing𝞪 • The probabilities of each value in the x-axis decrease drastically as 𝞪 increasestoo!

  24. Problems? • As previously discussed in other papers before, we don’t want the probabilities of a word appearing to0. • This is why we usesmoothing! • However, if we use Dirichlet, many values are left unseen, especially when 𝞪 is low • This is why we need some control over the tail-end behavior of thedistribution • that we obtain from the originaldistribution. • We want to see more uniquevalues!

  25. Pitman-YorProcess! • G ~PYP(𝞪,d, ) • d is a discountfactor • t is the number of different values drawn sofar • First Eq. increases probability of doing a newdraw • Second Eq. decreases the probability of assigning a draw to a previousvalue • We now have bettercontrol! Draw new xn fromH Set xn=x

  26. Improvement! X-axis: number of words drawn Y-axis: number of uniquewords Left: d =.5 and 𝞪 =1 (bottom), 10 (middle) and 100 (top) Right: same, with 𝞪 =10 and d =0 (bottom), .5 (middle) and .9(top) X-axis: number of wordsdrawn Y-axis: proportion of words appearing only once Left: d =.5 and 𝞪 =1 (bottom), 10 (middle), 100(top). Right: 𝞪 =10 and d =0 (bottom), .5 (middle) and .9 (top).

  27. Chinese RestaurantProcess What is Chinese Restaurant Process Probabilities of Joining ATable Power LawBehavior Parameters, Pitman Yor vs Dirichlet Hierarchical Chinese Restaurant Process-NLP

  28. Chinese Restaurant Process- Analogy toDirichlet ➔ Chinese restaurant with an infinite number of circulartables ➔ Each Table has InfiniteCapacity ➔ N customers enter the restaurant join an existing table or a newone. ➔ At Time t, there are M<=NTables ➔ Process result isExchangeable

  29. Probability of Joining aTable Prob to Join a NewTable Where: θ,d are Parameters C is the number of customers Ck is # of customers at tablek T is the number of DistinctTables Prob to Join an ExistingTable

  30. Probability of Joining atable To Join a NewTable (Param θ) +(# Distinct Tables)*(Param d) (Param θ) +(Total number ofCustomers ) Notice Something Interesting about theseformulas? To Join a Current Table (For eachtable) (Number of Customers at specific table) - (Param d) (Param θ)+(Total number of Customers)

  31. PowerLaw Only 2 variables areadjustable 1. Number of Customers at Table k- Rich getRicher If you increase this, you increase the chance of future customersjoining 2. DistinctTables If you increase this, you increase the chance that future customer will create a newtable.

  32. Example RestaurantDemo Let’s say 6 Friends go and grab lunch, however in this restaurant each table only serves a different type ofCuisine. The table choices are Stir-fry, Steak, Salad, Soup, Sandwiches. (S1-S5) Let’s Go through two restaurantsexamples: For Simplicity let d=0.5 and θ=10

  33. Parameter d, why Pitman Yor is Better thanDirichlet

  34. Why do we not want the # of individual words to drop to15%?

  35. Hierarchical Chinese Restaurant forNLP Now Let’s go back to our Chinese Restaurant and try to run our algorithm recursively in a NLPcontext. Remember our earlier two variableprobabilities: To Predict a New Word givencontext (Param θ) +(# Distinct Words Predicted)*(Paramd) To Predict an existingword (Number of times we predicted that word) +(Paramd) (Param θ) +(Total number ofWords predicted) (Param θ)+(Total number ofwords predicted )

  36. What does Hierarchical StructureMean? Create a Hierarchy of WordDistributions. Use Chinese Restaurant Process to Continuously DrawWords. Each layer of the hierarchy based on a previous context from layer above. Each Layer Contains restaurants (Group of Nodes) and Tables (Nodes) Each Restaurant is aContext. Each Table is aWord.

  37. SampleHierarchy G0 G1=”Soccer” G2=”Basketball” Love c=2 Practice ,c=10 Play, c=10 G3= “PlaySoccer” Boy G4=”PracticeSoccer” Girl Messi c=10 Girl c=10 Boy c=10 Messi c=10 c=10 c=10 “Girl PlaySoccer” “Boy PlaySoccer” “Messi PlaySoccer”

  38. PseudoCode

  39. Inference What is BayesianInference? Posterior ∝Likelihood x Prior PosteriorDistribution Posterior PredictiveDistribution (Which is a Chinese RestaurantProcess)

  40. PosteriorDistribution The posterior distribution over the latent vectors C ={G_v : all contexts v} and parameters Θ ={θ m , d m : 0 ≤m ≤n−1}: p(C, Θ|D) =p(C, Θ,D)/p(D) ∝p(C | Θ D) p(Θ) p(Θ) is generated from Dirichlet process(Our prior). p(C | Θ D) is a multinomialdistribution. The product of these two terms is also a Dirichlet distribution (It can be generated through a Dirichlet process).

  41. Predictive PosteriorDistribution what is the probability of a test word w after a contextu? However, the above integral is difficult, so we use Monte Carlo approximation to approximate the aboveintegral.

  42. Predictive PosteriorDistribution

  43. GibbsSampling We use Gibbs sampling to obtain the posterior samples {S,Θ} Gibbs Sampling isbased on Monte Carlo Markov Chain(MCMC)

  44. Results The paper uses perplexity for the evaluation, and it shows that this model is better than the state of art method, interpolated Kneser-Ney. It can compete with modified Kneser-Ney aswell.

  45. Conclusion Pitman Yor Process is a Hierarchical Nonparametric bayesian model , which utilizes Bayesian inference to update latentparameters. The Pitman Yor Process is simply a nonparametric extension of the Dirichlet process. The benefits of this language model is that is avoids the overfitting problems of N-gram models, and it provides a coherent explanation for themodel.

  46. Reference Teh, Y. W.; Jordan, M. I.; Beal, M. J.; Blei, D. M. (2006). "Hierarchical Dirichlet Processes"(PDF). Journal of the American Statistical Association. 101(476): pp.1566–1581. CiteSeerX10.1.1.5.9094.doi:10.1198/016214506000000302 Wikipedia contributors. (2019, April 4). Bayesian hierarchical modeling. In Wikipedia, The Free Encyclopedia. Retrieved 15:39, April 22, 2019, fromhttps://en.wikipedia.org/w/index.php?title=Bayesian_hierarchical_modeling&oldid=890951273 Wikipedia contributors. (2019, February 7). Dirichlet process. In Wikipedia, The Free Encyclopedia. Retrieved 15:41, April 22, 2019, from https://en.wikipedia.org/w/index.php?title=Dirichlet_process&oldid=882242821 Wikipedia contributors. (2019, February 16). Hierarchical Dirichlet process. In Wikipedia, The Free Encyclopedia. Retrieved 15:43, April 22, 2019, fromhttps://en.wikipedia.org/w/index.php?title=Hierarchical_Dirichlet_process&oldid=883602247

More Related