390 likes | 404 Views
Discover the essential tools and concepts of statistics with expert guidance from top-notch books. Learn about samples, populations, probability, and more in an engaging and practical manner. Enhance your data handling skills effortlessly.
E N D
Data Handling/Statistics • There is no substitute for books— • — you need professional help! • My personal favorites, from which this lecture is drawn: • The Cartoon Guide to Statistics, L. Gonick & W. Smith • Data Reduction in the Physical Sciences, P. R. Bevington • Workshop Statistics, A. J. Rossman & B. L. Chance • Numerical Recipes, W.H. Press, B.P. Flannery, S.A. Teukolsky • and W.T.Vetterling • Origin 6.1 Users Manual, MicroCal Corporation
Outline • Our motto • What those books look like • Stuff you need to be able to look up • Samples & Populations • Mean, Standard Deviation, Standard Error • Probability • Random Variables • Propagation of Errors • Stuff you must be able to do on a daily basis • Plot • Fit • Interpret
An opposing, non-CMC IGERT viewpoint The “progress” of civilization relies being able to do more and more things while thinking less and less about them. Our Motto That which can be taught can be learned.
What those books look like The Cartoon Guide to Statistics
The Cartoon Guide toStatistics In this example, the author provides step-by-step analysis of the statistics of a poll. Similar logic and style tell you how to tell two populations apart, whether your measley five replicate runs truly represent the situation, etc. The Cartoon Guide gives an enjoyable account of statistics in scientific and everyday life.
Bevington Bevington is really good at introducing basic concepts, along with simple code that really, really works. Our lab uses a lot of Bevington code, often translated from Fortran to Visual Basic.
“Workshop Statistics” This book has a website full of data that it tells you how to analyze. The test cases are often pretty interesting, too. Many little shadow boxes provide info.
“Numerical Recipes” A more modern and thicker version of Bevington. Code comes in Fortran, C, Basic (others?). Includes advanced topics like digital filtering, but harder to read on the simpler things. With this plus Bevington and a lot of time, you can fit, smooth, filter practically anything.
Stuff you need to be able to look up Samples vs. Populations The world as we understand it, based on science. The world as God understands it, based on omniscience. Statistics is not art but artifice–a bridge to help us understand phenomena, based on limited observations.
Our problem Sitting behind the target, can we say with some specific level of confidence whether a circle drawn around this single arrow (a measurement) hits the bullseye (the population mean)? Measuring a molecular weight by one Zimm plot, can we say with any certainty that we have obtained the same answer God would have gotten?
Sample View: direct, experimental, tangible The single most important thing about this is the reduction In standard deviation or standard error of the mean according To inverse root n. Three times better takes 9 times longer (or costs 9 times more, or takes 9 times more disk space). If you remembered nothing else from this lecture, it would be a success!
Population View: conceptual, layered with arcana! The purple equation in the table is an expression of the central limit theorem. If we measure many averages, we do not always get the same average:
Huh? It means…if you want to estimate s, which only God really knows, you should measure many averages, each involving n data points, figure their standard deviation, and multiply by n1/2. This is hard work! A lot of times, s is approximated by s. If you wanted to estimate the population average m, the best you can do is to measure many averages and averaging those. A lot of times m is approximated by x. IT’S HARD TO KNOW WHAT GOD DOES. I think the s in the purple equation should be an s, but the equation only works in the limit of large n anyhow, so there is no difference.
You got to compromise, fool! The t-distribution was invented by a statistician named Gosset, who was forced by his employer (the Guinness brewery!) to publish under a pseudonym. He chose “Student” and his t-distribution is known as student’s t. The student’s t distribution helps us assign confidence in our imperfect experiments on small samples. Input: desired confidence level, estimate of population mean (or estimated probability), estimated error of the mean (or probability). Output:± something
Continuous system Discrete system Probability …is another arcane concept in the “population” category: something we would like to know but cannot. As a concept, it’s wonderful. The true mean of a distribution of mass is given as the probability of that mass times the mass. The standard deviation follows a similarly simple rule. In what follows, F means a normalized frequency (think mole fraction!) and P is a probability density. P(x)dx represents the number of things (think molecules) with property x (think mass) between x+dx/2 and x-dx/2.
Here’s a normal probability density distribution from “Workshop…” where you use actual data to discover. • s 68% of results • 2s 95% of results
Although you don’t usually know the distribution, • (either m or s) about 68% of your measurements will • fall within 1s of m….if the distribution is a “normal”, • bell-shaped curve. t-tests allow you to kinda play this • backwards: given a finite sample size, with some • average, x, and standard deviation, s—inferior to • and s, respectively—how far away do we think the true m is? What it means
Details No way I could do it better than “Cartoon…” or “Workshop…” Remember…this is the part of the lecture entitled “things you must be able to look up.”
Propagation of errors Suppose you give 30 people a ruler and ask them to measure the length and width of a room. Owing to general incompetence, otherwise known as human nature, you will get not one answer but many. Your averages will be L and W, and standard deviations sW and sL. Now, you want to buy carpet, so need area A = L·W. What is the uncertainty in A due to the measurement errors in L and W? Answer! There is no telling….but you have several options to estimate it.
A = L·W example Here are your measured data: You can consider “most” and “least” cases:
Another way We can use a formula for how s propagates. Suppose some function y (think area) depends on two measured quantities t and s (think length & width). Then the variance in y follows this rule: Aren’t you glad you took partial differential equations? What??!! You didn’t? Well, sign up. PDE is the bare minimum math for scientists.
Translation in our case, where A = L·W: Problem: we don’t know W, L, sL or sW! These are population numbers we could only get if we had the entire planet measure this particular room. We therefore assume that our measurement set is large enough (n=30) That we can use our measured averages for W and L and our standard deviations for sL and sW.
Error propagation caveats The equation, , assumes normal behavior. Large systematic errors—for example, 3 euroguys who report their values in metric units—are not taken into consideration properly. In many cases, there will be good knowledge a priori about the uncertainty in one or more parameters: in photon counting, if N is the number of photons detected, then sN = (N)1/2 . Systematic error that is not included in this estimate, so photon folk are well advised to just repeat experiments to determine real standard deviations that do take systematic errors into account.
99.97% of the trend can be explained by the fitted relation. r=0.99987 r2=0.9997 Intercept = 0.003 ± 45 (i.e., zero!) Stuff you must know how to do on daily basis Plot!!!
How to find this file! r=0.444 r2=0.20 Only 20% of the data can be explained by the line! While G depended on q2, Dapp does not! The same data
Plot showing 95% confidence limits. Excel doesn’t excel at this!
Interpreting data: Life on the bleeding edge of cutting technology. Or is that bleating edge? The noise level in individual runs is much less than The run-to-run variation. That’s why many runs are a good idea. More would be good here, but we are still overcoming the shock that we can do this at all!
Excel does not automatically provide estimates! Correlation Caveat!Correlation Cause. No, Correlation=Association. 58% of life expectancy is associated with TV’s. Would we save lives by sending TV’s to Uganda?
Linearize it! Linearity is improved by plotting Life vs. people per TV rather than TV’s per people. Observant scientists are adept at seeing curvature. Train your eye by looking for defects in wallpaper, door trim, lumber bought at Home Depot, etc. And try to straighten out your data, rather than let the computer fit a nonlinear form, which it is quite happy to do!
Plots are pictures of science, worth thousands of words in boring tables. These 4 plots all have the Same slopes, intercepts and r values!
From whence do those lines come? Least squares fitting. “Linear Fits” the fitted coefficients appear in linear part expression.e.g.. y =a+bx+cx2+dx3 An analytical “best fit” exists! “Nonlinear fits” At least some of the fitted coefficients appear in transcendental arguments. e.g., y =a+be-cx+dcos(ex) Best fit found by trial & error. Beware false solutions! Try several initial guesses!
All data points are not created equal. Since that one point has so much error (or noise) should we really worry about minimizing its square? No. We should minimize “chisquared.” Goodness of fit parameter that should be unity for a “fit within error” n is the # of degrees of freedom n n-# of parameters fitted
c2 caveats • Chi-square lower than unity is meaningless…if you trust your s2 estimates in the first place. • Fitting too many parameters will lower c2 but this may be just doing a better and better job of fitting the noise! • A fit should go smoothly THROUGH the noise, not follow it! • There is such a thing as enforcing a “parsimonious” fit by minimizing a quantity a bit more complicated than c2. This is done when you have a-priori information that the fitted line must be “smooth”.
Achtung! Warning! This lecture is an example of a very dangerous phenomenon: “what you need to know.” Before you were born, I took a statistics course somewhere in undergraduate school. Most of this stuff I learned from experience….um… experiments. A proper math course, or a course from LSU’s Department of Experimental Statistics would firm up your knowledge greatly. AND BUY THOSE BOOKS! YOU WILL NEED THEM!