1 / 35

Presented By: Ashu Raj 09/21/2010

A Bayesian Method for Guessing the Extreme Values in a Dataset By: Mingxi Wu and Cristopher Jermaine. Presented By: Ashu Raj 09/21/2010 CSE 6339 DATA EXPLORATION AND ANALYSIS IN RELATIONAL DATABASES

skyler
Download Presentation

Presented By: Ashu Raj 09/21/2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Bayesian Method for Guessing the Extreme Values in a Dataset By: Mingxi Wu and Cristopher Jermaine Presented By: Ashu Raj 09/21/2010 CSE 6339 DATA EXPLORATION AND ANALYSIS IN RELATIONAL DATABASES FALL 2010

  2. OUTLINE: • Problem Statement • Application Examples • Bayesian Method • The learning Phase • The Characterization Phase • The Inference Phase • Experiments and Results • Conclusion

  3. Problem Statement: A ubiquitous problem in data management: Given a finite set of real values, can we take a sample from the set, and use the sample to predict the kth largest value in the entire set?

  4. Applications Example: Min/Max online aggregation Top-k query processing Outlier detection Probability Query Optimization

  5. A Bayesian Method: • Propose a natural estimator • Characterize error distribution of estimator (Bayesian) • Learn a prior model from the past query workload • Update the prior model using a sample • Sample an error distribution from the posterior model • With estimator and its error distribution, we can confidence bound the kth largest

  6. Natural Estimator: Data set size N Sample size n Estimator is the (k´)th largest in sample k´/n = k/N So, k´= ┌ (n/N * k) ┐ How accurate this method is?

  7. Characterize error distribution of estimator • How to determine the estimator’s error? • Study the relationship (estimator vs answer) Take the ratio of k th and (k´)th and find the ratio distribution • Don’t have any prior knowledge of D. • It is impossible to predict the ratio by looking at the sample only.

  8. With domain knowledge and sample, we can guess behavior of D • What domain knowledge should be modeled to help solving this problem?

  9. Setup: four data sets with different histogram shapes; each has10,000 values; we are looking for the largest value. Experiment: take a 100-element sample, record the obtained ratio kth/(k´)th . Do this 500 times. Importance of query shape

  10. Basics of Bayesian Approach • Proposed Bayesian Inference Framework • Learning Phase • Characterization Phase • Inference Phase

  11. The Learning Phase: • The Generative Model • Assume the existence of a set of possible shape patterns • Each shape has a weight, specifying how likely it matches with a new data set’s histogram shape.

  12. First, A biased die is rolled to determine by which shape pattern the query result set will be generated.(in previous figure, suppose we select shape 3) • Next, Arbitrary scale for the query is randomly generated. • Instantiate a parametric distribution f(x│shape,scale); this distribution is repeatedly sampled from to generate the new data set.

  13. Next, formalize and learn the model from the domain data(workload) • Probability Density Function(PDF) • Gamma distribution can produce data with arbitrary right leaning skew.

  14. Deriving the PDF: • Gamma Distribution PDF is: Where α > 0,known as the shape parameter and β > 0, known as inverse scale parameter. • Since scale does not matter, we treat β as an unknown random variable.

  15. Deriving the likelihood Model: • The resulting likelihood of a given dataset D is in the form: L(D│α) • This model assumes that a set of c weighted shapes. So, the complete likelihood model of observing D is: Where wjs are each non-negative weights and ∑cj=1wj =1. The complete set of model parameters is ⱷ = {ⱷ1,ⱷ2,…,ⱷc} where ⱷj = {wj, αj}

  16. Learning the parameters: • ⱷ is unknown and must be learned from the historical workload. • Given a set of independent domain data sets D ={D1, . . . ,Dr }, the likelihood of observing them is: • We use EM algorithm to learn the most likely so that: At this point , We have learned a prior shape model

  17. Now we have to apply EM algorithm to this prior model • Now we take the sample from the data set

  18. Use the Sample to update prior weight of each shape pattern

  19. EM Algorithm:

  20. Let S be our sample, applying Baye’s rule, the posterior weight of shape pattern j is: The resulting posterior shape model:

  21. Characterization Phase: • Derive error distribution associated with each shape. • Each shape characterizes an error distribution of kth/(k´)th • To find the error distribution for α, • Pick a scale β • 1. Query is produced by drawing a sample size N from the distribution Gamma f(x│α , β),the kth largest value in this sample is f(k). • 2.In order to estimate f(k), a sub sample of size n is drawn from the Sample obtained in step 1. the (k ´)th largest value in the subsample is the estimator f(k)´.

  22. Monte- Carlo Sampling

  23. TKD (Top k dependent) METHOD • efficiently produce f(k)´ given f(k). • First determines whether or not the subsample includes f(k) by means of a Bernoulli trail. • Depending upon the result, The TKD method figures out in the randomized method, the composition of the k´ largest values in the subsample with the help of Hypergeometric Method and returns the (k´)th largest. • The input parameters are same as Monte Carlo method with the addition of the sampled f(k). • This process assumes that we have an efficient method to sample a f(k) efficiently.

  24. Each shape characterizes an error distribution kth/(k´)th • To get the posterior error distribution, attach each shape’s posterior weight to it’s error distribution.

  25. Inference Phase: The final mixture error distribution:

  26. Given the distribution of kth/(k´)th, we can confidence bound the answer: • Choose pair of lower bound and upper bound (LB,UB),such that p% probabilty is covered. • Bound kth by [(k´)th * LB , (k´)th * UB] with p% probability.

  27. Summary: • Learn a prior shape model from historical queries • Devised a close-form model: a variant of Gamma mixture model • Employed an EM algorithm to learn the model from historical data • Update prior shape model with a sample • Applied Baye’s rule to update shape pattern’s weight • Produce an error distribution from the posterior model • Posterior weight attached to each shape’s error distribution • With our estimator and its error distribution, we can bound answer.

  28. Experiments and Results:

  29. More Results: • Distance-Based Outlier Detection • Improve the performance of state-of-the-art algorithm on an average factor of 4 over seven large data sets.

  30. Conclusion: Defined the problem of estimating the kth largest value in a real data set. Proposed an estimator Characterized the ratio error distribution by a Bayesian framework. Applied the proposed method to research problems successfully.

  31. References: Mingxi Wu, Chris Jermaine. A Bayesian Method for Guessing the Extreme Values in a Data Set , VLDB 2007. http://www.cise.ufl.edu/~mwu/research/extremeTalk.pdf

  32. THANK YOU…

More Related