1 / 28

Understanding Random Data Variation in Data Analysis

This chapter explores how to summarize and display random data, appreciating variation due to randomness. Learn about data summaries, sample statistics, data features, and graphical tools. Various distributions are discussed, alongside the weak law of large numbers, central limit theorem, and covariance concepts.

josehiggins
Download Presentation

Understanding Random Data Variation in Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapt 2. Variation How to: summarize/display random data appreciate variation due to randomness Data summaries. single observation y (number, curve, image,...) sample y1 ..., yn statistic s(y1 ..., yn)

  2. Features: location scale (spread) Sample moments = (y1 + ... + yn)/n average s2 = Σ (y - )2 /(n-1) sample variance Order statistics y(1) y(2) ... y(n) minimum, maximum, median, range quartiles, quantiles p 100% trimmed average IQR, MAD = median{|yi - median(yi)|}

  3. Bad data Outlier - observation unusual compared to the others Resistance Trimmed average Example (Midwife birth data). Hours in labor by day n = 95 = 7.57 hr s2 = 12.97 hr2 min, med, max = 1.5, 7.5, 19 hr quartiles 4.95, 9.75 hr

  4. Graphs. Indispensable in data analysis Histogram disjoint bins [L+(k-1),L+k) Plot count, nk , or proportion nk /n EDF #{yj y}/n Estimates CDF, Prob{Y  y} Scatter plot (uj , vj ) Parallel boxplots - location, scale, shape, outliers, comparative median, quartiles, 1.5 IQR

  5. Random sample Y1,...,Yn independent CDF F Mean E(Y) =  y dF(y) (= yf(y)dy if  density f) p quantile yp = F-1 (p) Laplace (continuous) f(y) = exp{-|y-|/}/2 , -<y< Poisson (discrete) Prob(Y=y) = f(y) = yexp{- }/y! , y=0,1,2, ... Count of daily arrivals + poisson Hours of labor + gamma

  6. Gamma f(y) = Will be providing many examples of useful distributions in these beginning chapters Some discrete, some continuous

  7. SF Chron 01/26/09

  8. Sampling variation. "the data y1 ,..., yn will be regarded as the observed values of random variables" - probabilities defined "ask how we would expect s(y1,...,yn) to behave on average, ..., understand the properties of S = S(Y1 ,...,Yn )" Y1,...,Yn sample from distribution mean , variance 2 Sample moment ; E( ) = nE(Yj )/n = , unbiased E(X + Y) = E(X) + E(Y)

  9. var( ) = 2/n var(X+Y) = Var(X) + var(Y), if uncorrelated var(aX) = a2 var(X)  (Yj - )2 =  (Yj - + - )2 =  (Yj - )2 + ( - )2 n2 = E(  (Yj - )2 ) + 2 E(S2) = 2, unbiased Birth data. n = 95, = 7.57 hr, s/n = 0.137 hr

  10. Probability plot. Checking probability model plot y(j) versus F-1(j/(n+1)) For normal take F =   from table or statistical package Normal prob plot "works" if ,  unknown For N(, 2 ), E(Y(j)) =  + E(Z(j) )

  11. Tools for approximation Weak law of large numbers.   in probability as n   is a consistent estimate of  Definition. {Sn} S in probability if for any  > 0 Pr(|Sn - S| > )  0 as n   If S = s0, constant and h(s) continuous at s0 then h(Sn) h(s0) in probability

  12. Central limit theorem. n( - )/ Z = N(0,1) in distribution as n   Definition. {Zn} converges in distribution to Z if Pr(Zn  z)  Pr(Z  z) as n   at every z for which Pr(Z  z) is continuous The CLT provides an approximation for "large" n

  13. Average as an estimate of . If X is N( ,2) then (X - )/ is N(0,1) Writing Zn = n( - )/ =  + n-1/2Zn Indicates how efficiency of depends on n and 

  14. Covariance and correlation. cov(X,Y) = xy = E[{X-E(X)}{Y-E(Y)}] sample covariance Cxy = nj=1 (Xj - )(Yj - )/(n-1) Cxy  xy in probability correlation  = cov(X,Y)/[var(X)var(Y)] -1    1 R = Cxy/[Cxx Cyy ] R   in probability

  15. R = -.340

  16. Some more distributions. Cauchy f(y) = 1/[{1 + (y - )2}] - < y < distribution of same as that of Y1 no moments, long tails Uniform F(u) = 0 u  0 = u 0<u1 = 1 1 < u E(U) = 1/2, center of gravity

  17. Exponential f(y) = 0 y < 0 = exp{-y} y  0 Pareto F(y) = 0 y < a = 1 - (y/a)- y  a a,  > 0 Poisson process Times of events y(1), y(2), y(3), ... y(1), y(3)-y(2), y(4)-y(3),... i.i.d. exponential

  18. Chi-squared distribution Z1 , Z2 ,..., Z IN(0,1) W = j=1 Z2j E(W) =  var(W) = 2 Multinomial page 47 p classes with probs 1 ,..., p adding to 1

  19. Linear combination L = a +  bj Yj E(L) = a +  bjj If independent var(L) =  bj2j2 If {Yj} are IN(j,j2), then L is N(a +  bjj,  bj2j2 )

  20. Moment-generating function MY(t) = E(exp{tY}), t real X, Y independent MX+Y (t) = MX(t)MY(t) For N(,2) M(t) = exp{t  + t22/2) The normal is determined by its moments

More Related