1 / 47

Data Handling/Statistics

Data Handling/Statistics. There is no substitute for books— — you need professional help! My personal favorites, from which this lecture is drawn: The Cartoon Guide to Statistics, L. Gonick & W. Smith

arleen
Download Presentation

Data Handling/Statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Handling/Statistics • There is no substitute for books— • — you need professional help! • My personal favorites, from which this lecture is drawn: • The Cartoon Guide to Statistics, L. Gonick & W. Smith • Data Reduction in the Physical Sciences, P. R. Bevington • Workshop Statistics, A. J. Rossman & B. L. Chance • Numerical Recipes, W.H. Press, B.P. Flannery, S.A. Teukolsky • and W.T.Vetterling • Origin 6.1 Users Manual, MicroCal Corporation

  2. Outline • Our motto • What those books look like • Stuff you need to be able to look up • Samples & Populations • Mean, Standard Deviation, Standard Error • Probability • Random Variables • Propagation of Errors • Stuff you must be able to do on a daily basis • Plot • Fit • Interpret

  3. An opposing, non-CMC IGERT viewpoint The “progress” of civilization relies being able to do more and more things while thinking less and less about them. Our Motto That which can be taught can be learned.

  4. What those books look like The Cartoon Guide to Statistics

  5. The Cartoon Guide toStatistics In this example, the author provides step-by-step analysis of the statistics of a poll. Similar logic and style tell you how to tell two populations apart, whether your measley five replicate runs truly represent the situation, etc. The Cartoon Guide gives an enjoyable account of statistics in scientific and everyday life.

  6. An Introduction to Error Analysis A very readable text, but with enough math to be rigorous. The cover says it all – the book’s emphasis is how statistics and error analysis are important in the everyday. Author John Taylor is known as “Mr. Wizard” at Univ. of Colorado, for his popular science lectures aimed at youngsters.

  7. Bevington Bevington is really good at introducing basic concepts, along with simple code that really, really works. Our lab uses a lot of Bevington code, often translated from Fortran to Visual Basic.

  8. “Workshop Statistics” This book has a website full of data that it tells you how to analyze. The test cases are often pretty interesting, too. Many little shadow boxes provide info.

  9. “Numerical Recipes” A more modern and thicker version of Bevington. Code comes in Fortran, C, Basic (others?). Includes advanced topics like digital filtering, but harder to read on the simpler things. With this plus Bevington and a lot of time, you can fit, smooth, filter practically anything.

  10. Stuff you need to be able to look up Samples vs. Populations The world as we understand it, based on science. The world as God understands it, based on omniscience. Statistics is not art but artifice–a bridge to help us understand phenomena, based on limited observations.

  11. Our problem Sitting behind the target, can we say with some specific level of confidence whether a circle drawn around this single arrow (a measurement) hits the bullseye (the population mean)? Measuring a molecular weight by one Zimm plot, can we say with any certainty that we have obtained the same answer God would have gotten?

  12. Sample View: direct, experimental, tangible The single most important thing about this is the reduction In standard deviation or standard error of the mean according To inverse root n. Three times better takes 9 times longer (or costs 9 times more, or takes 9 times more disk space). If you remembered nothing else from this lecture, it would be a success!

  13. Population View: conceptual, layered with arcana! The purple equation in the table is an expression of the central limit theorem. If we measure many averages, we do not always get the same average:

  14. Huh? It means…if you want to estimate s, which only God really knows, you should measure many averages, each involving n data points, figure their standard deviation, and multiply by n1/2. This is hard work! A lot of times, s is approximated by s. If you wanted to estimate the population average m, the best you can do is to measure many averages and averaging those. A lot of times m is approximated by x. IT’S HARD TO KNOW WHAT GOD DOES. I think the s in the purple equation should be an s, but the equation only works in the limit of large n anyhow, so there is no difference.

  15. You got to compromise, fool! The t-distribution was invented by a statistician named Gosset, who was forced by his employer (the Guinness brewery!) to publish under a pseudonym. He chose “Student” and his t-distribution is known as student’s t. The student’s t distribution helps us assign confidence in our imperfect experiments on small samples. Input: desired confidence level, estimate of population mean (or estimated probability), estimated error of the mean (or probability). Output:± something

  16. Continuous system Discrete system Probability …is another arcane concept in the “population” category: something we would like to know but cannot. As a concept, it’s wonderful. The true mean of a distribution of mass is given as the probability of that mass times the mass. The standard deviation follows a similarly simple rule. In what follows, F means a normalized frequency (think mole fraction!) and P is a probability density. P(x)dx represents the number of things (think molecules) with property x (think mass) between x+dx/2 and x-dx/2.

  17. Here’s a normal probability density distribution from “Workshop…” where you use actual data to discover. •  s 68% of results •  2s 95% of results

  18. Although you don’t usually know the distribution, • (either m or s) about 68% of your measurements will • fall within  1s of m….if the distribution is a “normal”, • bell-shaped curve. t-tests allow you to kinda play this • backwards: given a finite sample size, with some • average, x, and standard deviation, s—inferior to • and s, respectively—how far away do we think the true m is? What it means

  19. Details No way I could do it better than “Cartoon…” or “Workshop…” Remember…this is the part of the lecture entitled “things you must be able to look up.”

  20. Propagation of errors Suppose you give 30 people a ruler and ask them to measure the length and width of a room. Owing to general incompetence, otherwise known as human nature, you will get not one answer but many. Your averages will be L and W, and standard deviations sW and sL. Now, you want to buy carpet, so need area A = L·W. What is the uncertainty in A due to the measurement errors in L and W? Answer! There is no telling….but you have several options to estimate it.

  21. A = L·W example Here are your measured data: You can consider “most” and “least” cases:

  22. Another way We can use a formula for how s propagates. Suppose some function y (think area) depends on two measured quantities t and s (think length & width). Then the variance in y follows this rule: Aren’t you glad you took partial differential equations? What??!! You didn’t? Well, sign up. PDE is the bare minimum math for scientists.

  23. Translation in our case, where A = L·W: Problem: we don’t know W, L, sL or sW! These are population numbers we could only get if we had the entire planet measure this particular room. We therefore assume that our measurement set is large enough (n=30) That we can use our measured averages for W and L and our standard deviations for sL and sW.

  24. Error propagation caveats The equation, , assumes normal behavior. Large systematic errors—for example, 3 euroguys who report their values in metric units—are not taken into consideration properly. In many cases, there will be good knowledge a priori about the uncertainty in one or more parameters: in photon counting, if N is the number of photons detected, then sN = (N)1/2 . Systematic error that is not included in this estimate, so photon folk are well advised to just repeat experiments to determine real standard deviations that do take systematic errors into account.

  25. 99.97% of the trend can be explained by the fitted relation. r=0.99987 r2=0.9997 Intercept = 0.003 ± 45 (i.e., zero!) Stuff you must know how to do on daily basis Plot!!!

  26. How to find this file! r=0.444 r2=0.20 Only 20% of the data can be explained by the line! While G depended on q2, Dapp does not! The same data

  27. time melting point 2 4 8 12 16 24 36 48 110.2 110.9 108.8 109.1 109.0 108.5 110.0 109.2 What does the famous “ r2 ” really tell us? Suppose you invented a new polymer that you hoped was more stable over time than its predecessor…So you check.

  28. time melting point 2 4 8 12 16 24 36 48 110.2 110.9 108.8 109.1 109.0 108.5 110.0 109.2 Question: What describes the data better: A simple average (meaning things aren’t really changing over time: it is stable) OR A trend (meaning melting point might be dropping over time)?

  29. These are called ‘residuals.’ The sum of the square of all the residuals characterizes how well the data fit the mean. How well does the mean describe the data? (= 4.6788)

  30. The regression also has residuals. The sum of their squares is smaller than St. How much better is a fit(i.e., a regression in this case)? (= 4.3079)

  31. The r2 value simply compares the fit to the mean, by comparing the sums of the squares: In our case, the fit was NOT a dramatic improvement, explaining only 7.9% of the variability of the data!

  32. Plot showing 95% confidence limits. Excel doesn’t excel at this!

  33. Interpreting data: Life on the bleeding edge of cutting technology. Or is that bleating edge? The noise level in individual runs is much less than The run-to-run variation. That’s why many runs are a good idea. More would be good here, but we are still overcoming the shock that we can do this at all!

  34. Excel does not automatically provide  estimates! Correlation Caveat!Correlation  Cause. No, Correlation=Association. 58% of life expectancy is associated with TV’s. Would we save lives by sending TV’s to Uganda?

  35. Linearize it! Linearity is improved by plotting Life vs. people per TV rather than TV’s per people. Observant scientists are adept at seeing curvature. Train your eye by looking for defects in wallpaper, door trim, lumber bought at Home Depot, etc. And try to straighten out your data, rather than let the computer fit a nonlinear form, which it is quite happy to do!

  36. Plots are pictures of science, worth thousands of words in boring tables. These 4 plots all have the Same slopes, intercepts and r values!

  37. From whence do those lines come? Least squares fitting. “Linear Fits” the fitted coefficients appear in linear part expression.e.g.. y =a+bx+cx2+dx3 An analytical “best fit” exists! “Nonlinear fits” At least some of the fitted coefficients appear in transcendental arguments. e.g., y =a+be-cx+dcos(ex) Best fit found by trial & error. Beware false solutions! Try several initial guesses!

  38. CURVE FITTING:Fit the trend or fit the points? Earth’s mean annual temp has natural fluctuations year to year. To capture a long term trend, we don’t want to fit the points, so use a low-order polynomial regression.

  39. BUT, The bumps and jiggles in the U.S. population data are ‘real.’ We don’t want to lose them in a simple trend.

  40. REGRESSION: We lost the baby boom! SINGLE POLYNOMIAL: Does funny things (see 1905). SPLINE: YES: Lots of individual polynomials give us a smooth fit (especially good for interpolation).

  41. All data points are not created equal. Since that one point has so much error (or noise) should we really worry about minimizing its square? No. We should minimize “chisquared.” Goodness of fit parameter that should be unity for a “fit within error” n is the # of degrees of freedom n  n-# of parameters fitted

  42. Based on chi: these two curves fit equally well! Based on |chi| (absolute value): these three curves fit equally well! Based on max(chi): outliers exert too strong an influence! Why is a fit based on chisquared so special?

  43. c2 caveats • Chi-square lower than unity is meaningless…if you trust your s2 estimates in the first place. • Fitting too many parameters will lower c2 but this may be just doing a better and better job of fitting the noise! • A fit should go smoothly THROUGH the noise, not follow it! • There is such a thing as enforcing a “parsimonious” fit by minimizing a quantity a bit more complicated than c2. This is done when you have a-priori information that the fitted line must be “smooth”.

  44. Achtung! Warning! This lecture is an example of a very dangerous phenomenon: “what you need to know.” Before you were born, I took a statistics course somewhere in undergraduate school. Most of this stuff I learned from experience….um… experiments. A proper math course, or a course from LSU’s Department of Experimental Statistics would firm up your knowledge greatly. AND BUY THOSE BOOKS! YOU WILL NEED THEM!

  45. Cool Excel/Origin Demo

More Related