350 likes | 457 Views
A Bayesian Method for Guessing the Extreme Values in a Dataset By: Mingxi Wu and Cristopher Jermaine. Presented By: Ashu Raj 09/21/2010 CSE 6339 DATA EXPLORATION AND ANALYSIS IN RELATIONAL DATABASES
E N D
A Bayesian Method for Guessing the Extreme Values in a Dataset By: Mingxi Wu and Cristopher Jermaine Presented By: Ashu Raj 09/21/2010 CSE 6339 DATA EXPLORATION AND ANALYSIS IN RELATIONAL DATABASES FALL 2010
OUTLINE: • Problem Statement • Application Examples • Bayesian Method • The learning Phase • The Characterization Phase • The Inference Phase • Experiments and Results • Conclusion
Problem Statement: A ubiquitous problem in data management: Given a finite set of real values, can we take a sample from the set, and use the sample to predict the kth largest value in the entire set?
Applications Example: Min/Max online aggregation Top-k query processing Outlier detection Probability Query Optimization
A Bayesian Method: • Propose a natural estimator • Characterize error distribution of estimator (Bayesian) • Learn a prior model from the past query workload • Update the prior model using a sample • Sample an error distribution from the posterior model • With estimator and its error distribution, we can confidence bound the kth largest
Natural Estimator: Data set size N Sample size n Estimator is the (k´)th largest in sample k´/n = k/N So, k´= ┌ (n/N * k) ┐ How accurate this method is?
Characterize error distribution of estimator • How to determine the estimator’s error? • Study the relationship (estimator vs answer) Take the ratio of k th and (k´)th and find the ratio distribution • Don’t have any prior knowledge of D. • It is impossible to predict the ratio by looking at the sample only.
With domain knowledge and sample, we can guess behavior of D • What domain knowledge should be modeled to help solving this problem?
Setup: four data sets with different histogram shapes; each has10,000 values; we are looking for the largest value. Experiment: take a 100-element sample, record the obtained ratio kth/(k´)th . Do this 500 times. Importance of query shape
Basics of Bayesian Approach • Proposed Bayesian Inference Framework • Learning Phase • Characterization Phase • Inference Phase
The Learning Phase: • The Generative Model • Assume the existence of a set of possible shape patterns • Each shape has a weight, specifying how likely it matches with a new data set’s histogram shape.
First, A biased die is rolled to determine by which shape pattern the query result set will be generated.(in previous figure, suppose we select shape 3) • Next, Arbitrary scale for the query is randomly generated. • Instantiate a parametric distribution f(x│shape,scale); this distribution is repeatedly sampled from to generate the new data set.
Next, formalize and learn the model from the domain data(workload) • Probability Density Function(PDF) • Gamma distribution can produce data with arbitrary right leaning skew.
Deriving the PDF: • Gamma Distribution PDF is: Where α > 0,known as the shape parameter and β > 0, known as inverse scale parameter. • Since scale does not matter, we treat β as an unknown random variable.
Deriving the likelihood Model: • The resulting likelihood of a given dataset D is in the form: L(D│α) • This model assumes that a set of c weighted shapes. So, the complete likelihood model of observing D is: Where wjs are each non-negative weights and ∑cj=1wj =1. The complete set of model parameters is ⱷ = {ⱷ1,ⱷ2,…,ⱷc} where ⱷj = {wj, αj}
Learning the parameters: • ⱷ is unknown and must be learned from the historical workload. • Given a set of independent domain data sets D ={D1, . . . ,Dr }, the likelihood of observing them is: • We use EM algorithm to learn the most likely so that: At this point , We have learned a prior shape model
Now we have to apply EM algorithm to this prior model • Now we take the sample from the data set
Let S be our sample, applying Baye’s rule, the posterior weight of shape pattern j is: The resulting posterior shape model:
Characterization Phase: • Derive error distribution associated with each shape. • Each shape characterizes an error distribution of kth/(k´)th • To find the error distribution for α, • Pick a scale β • 1. Query is produced by drawing a sample size N from the distribution Gamma f(x│α , β),the kth largest value in this sample is f(k). • 2.In order to estimate f(k), a sub sample of size n is drawn from the Sample obtained in step 1. the (k ´)th largest value in the subsample is the estimator f(k)´.
TKD (Top k dependent) METHOD • efficiently produce f(k)´ given f(k). • First determines whether or not the subsample includes f(k) by means of a Bernoulli trail. • Depending upon the result, The TKD method figures out in the randomized method, the composition of the k´ largest values in the subsample with the help of Hypergeometric Method and returns the (k´)th largest. • The input parameters are same as Monte Carlo method with the addition of the sampled f(k). • This process assumes that we have an efficient method to sample a f(k) efficiently.
Each shape characterizes an error distribution kth/(k´)th • To get the posterior error distribution, attach each shape’s posterior weight to it’s error distribution.
Inference Phase: The final mixture error distribution:
Given the distribution of kth/(k´)th, we can confidence bound the answer: • Choose pair of lower bound and upper bound (LB,UB),such that p% probabilty is covered. • Bound kth by [(k´)th * LB , (k´)th * UB] with p% probability.
Summary: • Learn a prior shape model from historical queries • Devised a close-form model: a variant of Gamma mixture model • Employed an EM algorithm to learn the model from historical data • Update prior shape model with a sample • Applied Baye’s rule to update shape pattern’s weight • Produce an error distribution from the posterior model • Posterior weight attached to each shape’s error distribution • With our estimator and its error distribution, we can bound answer.
More Results: • Distance-Based Outlier Detection • Improve the performance of state-of-the-art algorithm on an average factor of 4 over seven large data sets.
Conclusion: Defined the problem of estimating the kth largest value in a real data set. Proposed an estimator Characterized the ratio error distribution by a Bayesian framework. Applied the proposed method to research problems successfully.
References: Mingxi Wu, Chris Jermaine. A Bayesian Method for Guessing the Extreme Values in a Data Set , VLDB 2007. http://www.cise.ufl.edu/~mwu/research/extremeTalk.pdf