1 / 22

Lecture 4: Fitting distributions: goodness of fit

Lecture 4: Fitting distributions: goodness of fit. Goodness of fit Testing goodness of fit Testing normality An important note on testing normality!. 30. 20. Frequency. 10. Expected. 0. 20. 30. 40. 50. 60. Observed. Fork length. Goodness of fit.

jatin
Download Presentation

Lecture 4: Fitting distributions: goodness of fit

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 4: Fitting distributions: goodness of fit • Goodness of fit • Testing goodness of fit • Testing normality • An important note on testing normality! Bio 4118 Applied Biostatistics

  2. 30 20 Frequency 10 Expected 0 20 30 40 50 60 Observed Fork length Goodness of fit • measures the extent to which some empirical distribution “fits” the distribution expected under the null hypothesis Bio 4118 Applied Biostatistics

  3. Accept H0 Reject H0 Goodness of fit: the underlying principle 30 Expected Observed 20 • If the match between observed and expected is poorer than would be expected on the basis of measurement precision, then we should reject the null hypothesis. 0 Frequency 30 20 10 0 20 30 40 50 60 Fork length Bio 4118 Applied Biostatistics

  4. Expected Observed Frequency Category/class Testing goodness of fit : the Chi-square statistic (C2) • Used for frequency data, i.e. the number of observations/results in each of n categories compared to the number expected under the null hypothesis. Bio 4118 Applied Biostatistics

  5. 0.3 c2 = 8.5, p = 0.31 accept Probability 0.2 p = a = 0.05 0 c2 (df = 5) 0 5 10 15 20 How to translate C2 into p? • Compare to the 2 distribution with n - 1 degrees of freedom. • If p is less than the desired  level, reject the null hypothesis. Bio 4118 Applied Biostatistics

  6. Expected Observed Frequency Category/class Testing goodness of fit: the log likelihood-ratio Chi-square statistic (G) • Similar to C2, andusually gives similar results. • In some cases, G is more conservative (i.e. will give higher p values). Bio 4118 Applied Biostatistics

  7. 0.3 Probability 0.2 0 c2/C2/G (df = 5) 0 5 10 15 20 c2 versus the distribution of C2 or G • For both C2 and G, p values are calculated assuming a 2 distribution... • ...but as n decreases, both deviate more and more from2. C2/G, very small n C2/G, small n c2 Bio 4118 Applied Biostatistics

  8. Assumptions (C2 and G) • n is larger than 30. • Expected frequencies are all larger than 5. • Test is quite robust except when there are only 2 categories (df = 1). • For 2 categories, both X2 and G overestimate 2, leading to rejection of null hypothesis with probability greater than , i.e. the test is liberal. Bio 4118 Applied Biostatistics

  9. What if n is too small, there are only 2 categories, etc.? • Collect more data, thereby increasing n. • If n > 2, combine categories. • Use a correction factor. • Use another test. More data Classes combined Bio 4118 Applied Biostatistics

  10. Corrections for 2 categories • For 2 categories, both X2 and G overestimate 2, leading to rejection of null hypothesis with probability greater than i.e. test is liberal. • Continuity correction: add 0.5 to observed frequencies. • Williams’ correction: divide test statistic (G or C2) by: Bio 4118 Applied Biostatistics

  11. The binomial test • Used when there are 2 categories. • No assumptions • Calculate exact probability of obtaining N - k individuals in category 1 and k individuals in category 2, with k = 0, 1, 2,... N. Probability 0 1 2 3 4 5 6 7 8 9 10 Number of observations Binominal distribution, p = 0.5, N = 10 Bio 4118 Applied Biostatistics

  12. An example: sex ratio of beavers • H0: sex-ratio is 1:1, so p = 0.5 = q • p(0 males, females) = .00195 • p(1 male/female, 9 male/female) = .0195 • p(9 or more individuals of same sex) = .0215, or 2.15%. • therefore, reject H0 Bio 4118 Applied Biostatistics

  13. Multinomial test • Simple extension of binomial test for more than 2 categories • Must specify 2 probabilities, p and q, for null hypothesis, p + q + r = 1.0. • No assumptions... • ...but so tedious that in practice C2 is used. Bio 4118 Applied Biostatistics

  14. Multinomial test: segregation ratios • Hypothesis: both parents Aa, therefore segregation ratio is 1 AA: 2 Aa: 1 aa. • So under H0, p =.25, q = .50, r = .25 • For N = 60, p < .001 • Therefore, reject H0. Bio 4118 Applied Biostatistics

  15. Expected under hypothesis of normal distribution Observed Frequency Category/class Goodness of fit: testing normality • Since normality is an assumption of all parametric statistical tests, testing for normality is often required. • Tests for normality include C2 or G, Kolmogorov-Smirnov, Wilks-Shapiro & Lilliefors. Bio 4118 Applied Biostatistics

  16. 1.0 Cumulative normal density function 0.8 Normal probability density function 0.6 50.00% F 0.4 2.28% 0.2 68.27% 0 -3s -2s -s m s 2s 3s Cumulative distributions • Areas under the normal probability density function and the cumulative normal distribution function Bio 4118 Applied Biostatistics

  17. Frequency Category/class C2 or G test for normality Expected under hypothesis of normal distribution • Put data in classes (histogram) and compute expected frequencies based on discrete normal distribution. • Calculate C2. • Requires large samples (kmin = 10) and is not powerful because of loss of information. Observed Bio 4118 Applied Biostatistics

  18. NEDs Normal Non-normal X “Non-statistical” assessments of normality • Do normal probability plot of normal equivalent deviates (NEDs) versus X. • If line appears more or less straight, then data are approximately normally distributed. Bio 4118 Applied Biostatistics

  19. 1.0 0.8 0.6 Cumulative frequency 0.4 Dmax 0.2 X Komolgorov-Smirnov goodness of fit • Compares observed cumulative distribution to expected cumulative distribution under the null hypothesis. • p is based on Dmax, absolute difference,between observed and expected cumulative relative frequencies. Bio 4118 Applied Biostatistics

  20. 1.0 0.8 Dmax 0.6 Cumulative frequency 0.4 0.2 4.0 4.5 5.0 5.5 6.0 Wing length An example: wing length in flies • 10 flies with wing lengths: 4, 4.5, 4.9, 5.0, 5.1, 5.3, 5.5, 5.6, 5.7, 5.8, 5.9, 6.0 • cumulative relative frequencies: .1, .2, .3, .4, .5, .6, .7, .8, .9, 1.0 Bio 4118 Applied Biostatistics

  21. Lilliefors test • KS test is conservative for tests in which the expected distribution is based on sample statistics. • Liliiefors corrects for this to produce a more reliable test. • Should be used when null hypothesis is intrinsic versus extrinsic. Bio 4118 Applied Biostatistics

  22. An important note on testing normality! • When N is small, most tests have low power. • Hence, very large deviations are required in order to reject the null. • When N is large, power is high. • Hence, very small deviations from normality will be sufficient to reject the null. • So, exercise common sense! Bio 4118 Applied Biostatistics

More Related