Understanding Random Data Variation in Data Analysis

Chapt 2. Variation How to: summarize/display random data appreciate variation due to randomness Data summaries. single observation y (number, curve, image,...) sample y1 ..., yn statistic s(y1 ..., yn)

Features: location scale (spread) Sample moments = (y1 + ... + yn)/n average s2 = Σ (y - )2 /(n-1) sample variance Order statistics y(1) y(2) ... y(n) minimum, maximum, median, range quartiles, quantiles p 100% trimmed average IQR, MAD = median{|yi - median(yi)|}

Bad data Outlier - observation unusual compared to the others Resistance Trimmed average Example (Midwife birth data). Hours in labor by day n = 95 = 7.57 hr s2 = 12.97 hr2 min, med, max = 1.5, 7.5, 19 hr quartiles 4.95, 9.75 hr

Graphs. Indispensable in data analysis Histogram disjoint bins [L+(k-1),L+k) Plot count, nk , or proportion nk /n EDF #{yj y}/n Estimates CDF, Prob{Y  y} Scatter plot (uj , vj ) Parallel boxplots - location, scale, shape, outliers, comparative median, quartiles, 1.5 IQR

Random sample Y1,...,Yn independent CDF F Mean E(Y) =  y dF(y) (= yf(y)dy if  density f) p quantile yp = F-1 (p) Laplace (continuous) f(y) = exp{-|y-|/}/2 , -<y< Poisson (discrete) Prob(Y=y) = f(y) = yexp{- }/y! , y=0,1,2, ... Count of daily arrivals + poisson Hours of labor + gamma

Gamma f(y) = Will be providing many examples of useful distributions in these beginning chapters Some discrete, some continuous

SF Chron 01/26/09

Sampling variation. "the data y1 ,..., yn will be regarded as the observed values of random variables" - probabilities defined "ask how we would expect s(y1,...,yn) to behave on average, ..., understand the properties of S = S(Y1 ,...,Yn )" Y1,...,Yn sample from distribution mean , variance 2 Sample moment ; E( ) = nE(Yj )/n = , unbiased E(X + Y) = E(X) + E(Y)

var( ) = 2/n var(X+Y) = Var(X) + var(Y), if uncorrelated var(aX) = a2 var(X)  (Yj - )2 =  (Yj - + - )2 =  (Yj - )2 + ( - )2 n2 = E(  (Yj - )2 ) + 2 E(S2) = 2, unbiased Birth data. n = 95, = 7.57 hr, s/n = 0.137 hr

Probability plot. Checking probability model plot y(j) versus F-1(j/(n+1)) For normal take F =   from table or statistical package Normal prob plot "works" if ,  unknown For N(, 2 ), E(Y(j)) =  + E(Z(j) )

Tools for approximation Weak law of large numbers.   in probability as n   is a consistent estimate of  Definition. {Sn} S in probability if for any  > 0 Pr(|Sn - S| > )  0 as n   If S = s0, constant and h(s) continuous at s0 then h(Sn) h(s0) in probability

Central limit theorem. n( - )/ Z = N(0,1) in distribution as n   Definition. {Zn} converges in distribution to Z if Pr(Zn  z)  Pr(Z  z) as n   at every z for which Pr(Z  z) is continuous The CLT provides an approximation for "large" n

Average as an estimate of . If X is N( ,2) then (X - )/ is N(0,1) Writing Zn = n( - )/ =  + n-1/2Zn Indicates how efficiency of depends on n and 

Covariance and correlation. cov(X,Y) = xy = E[{X-E(X)}{Y-E(Y)}] sample covariance Cxy = nj=1 (Xj - )(Yj - )/(n-1) Cxy  xy in probability correlation  = cov(X,Y)/[var(X)var(Y)] -1    1 R = Cxy/[Cxx Cyy ] R   in probability

R = -.340

Some more distributions. Cauchy f(y) = 1/[{1 + (y - )2}] - < y < distribution of same as that of Y1 no moments, long tails Uniform F(u) = 0 u  0 = u 0<u1 = 1 1 < u E(U) = 1/2, center of gravity

Exponential f(y) = 0 y < 0 = exp{-y} y  0 Pareto F(y) = 0 y < a = 1 - (y/a)- y  a a,  > 0 Poisson process Times of events y(1), y(2), y(3), ... y(1), y(3)-y(2), y(4)-y(3),... i.i.d. exponential

Chi-squared distribution Z1 , Z2 ,..., Z IN(0,1) W = j=1 Z2j E(W) =  var(W) = 2 Multinomial page 47 p classes with probs 1 ,..., p adding to 1

Linear combination L = a +  bj Yj E(L) = a +  bjj If independent var(L) =  bj2j2 If {Yj} are IN(j,j2), then L is N(a +  bjj,  bj2j2 )

Moment-generating function MY(t) = E(exp{tY}), t real X, Y independent MX+Y (t) = MX(t)MY(t) For N(,2) M(t) = exp{t  + t22/2) The normal is determined by its moments

Understanding Random Data Variation in Data Analysis

Understanding Random Data Variation in Data Analysis

Presentation Transcript