140 likes | 277 Views
CSE 6392 – Data Exploration and Analysis in Relational Databases. January 31, 2006. Example Problem. Suppose you had the following tables:. Employee. Employee-Sample. Possible Queries. Some possible queries to get the average salary of all females in the company:
E N D
CSE 6392 – Data Exploration and Analysis in Relational Databases January 31, 2006
Example Problem Suppose you had the following tables: Employee Employee-Sample
Possible Queries • Some possible queries to get the average salary of all females in the company: • Select avg(salary) from Employee where gender = “F” • Select avg(salary) from Employee-Sample where gender = “F” • Select count(*) as C, sum(salary) as S, S/C from Employee-Sample where gender = “F” • Is there a difference between 2 and 3 in terms of results? No.
Estimator • What is an estimator? • Ex. count of a sample * (population/count) • On the previous slide, 2 and 3 are estimators for 1. • What is an unbiased estimator? • Basically, an estimator that is not tilted towards the lower or higher side of the estimation • Formally: • is the estimator for some quantity x • is an unbiased estimator if E[ ] = x.
Unbiased Estimators • Example • select count(*) as FC from Employee where gender = “F” • select count(*) * (N/n) as EFC from Employee-Sample with gender = “F” • EFC is an unbiased estimator • (N/n) is called the ‘ratio scale’
Unbiased Estimators (1) • Example • select sum(salary) as TFS from Employee where gender = “F” • select sum(salary)*(N/n) as ETFS from Employee-Sample where gender = “F” • ETFS is an unbiased estimator • Note: This is important to statisticians, but secondary for our purposes; we are more concerned about the error
Unbiased Estimators (2) • Example • Select avg(salary) as AFS from Employee where gender = “F” • Select count(*) as C, sum(salary) as S, EAFS=S/C from Employee-Sample where gender = “F” • Is EAFS unbiased? Not necessarily. The use of 2 unbiased estimators does not make it unbiased (ratio estimation).
Probability • Example: roll a die. How many times will you get 1, 2, 3, 4, 5 or 6?
Probability Density • What is the probability that a random number generator will generate .43 (of numbers between 0 and 1)? • Answer: 0% (1/infinity) • What about between .43 and .53? • Answer: 10% (1/10) • The probability density is the area under the curve (integral) = 1. • Any single number has a 0% probability, but an interval has a chance.
Probability Density Function Proper distribution if integral = 1
Probability Example • How many female employees (out of 50K employees)?
Probability Sample • If we sampled another company where the actual number of females is 5K, the variance would decrease:
Relative Error • In Approximate Query Processing, people use absolute error statistically, but relative error practically. relative error2 = (ETFC – TFC)2 TFC2
Central Limit Theorem • The main point of this theorem is that it does not matter how it was originally distributed – the sample distribution will be normal. • Normal distribution: