350 likes | 358 Views
data analysis & Basic statistics. xiao wu xiao.wu@yale.edu. Purpose of this workshop. Statistics as a useful tool to analyze results Basic terminology and most commonly used tests Exposure to more advanced statistical tools. why do we need statistics?. why do we need statistics?. Summary
E N D
data analysis & Basic statistics xiaowu xiao.wu@yale.edu
Purpose of this workshop • Statisticsasausefultooltoanalyzeresults • Basicterminologyandmostcommonlyusedtests • Exposuretomoreadvancedstatisticaltools
why do we need statistics? • Summary • Classification • Interpretation • Pattern searching • Abnormality identification • Prediction • Intrapolation • Extrapolation
Summary http://www.mymarketresearchmethods.com/descriptive-inferential-statistics-difference/
Summary • Mean, median,mode • Variance, standard deviation • Max, min values and range • Quartiles http://www.mymarketresearchmethods.com/descriptive-inferential-statistics-difference/
Example Firm A • Mean: $5,800 Firm B • Mean: $5,000
Example Firm A • Mean: $5,800 • Median: $4,000 • SD: $7,270 • 3rd Quartile: $4,000 • 1st Quartile: $500 Firm B Mean: $5,000 Median: $5,000 SD: $203 3rd Quartile: $5,175 1st Quartile: $4,825
Classification Identification of variable • Independent vs. dependent • Numeric vs. categorical
Pattern searching • Distribution of data • Some commonly used distributions • Uniform • Binomial • Poisson • … • Central limit theorem http://www.mathwave.com/img/art/graphs_pdf2.gif
Uniform • Every outcome has equal chance • Example: • Flipping a coin • Rolling a dice • What if you need to flip multiple times?
Binomial • Two outcomes, probability p and 1-p • Multiple trials: n • Example: • Flipping a coin 100 times • Germination of multiple seeds https://onlinecourses.science.psu.edu/stat414/sites/onlinecourses.science.psu.edu.stat414/files/lesson09/graph_n15_p02.gif
Poisson • Counts of rare, independent events • Each with probability, or average rate p • Example: radioactive decay http://kaffee.50webs.com/Science/images/alpha_decay.gif
Normaldistribution • Centrallimittheorem • Everydistributionconvergestoanormaldistribution • Largesamplesize normaldistribution Parameters: • mean • standard deviation https://www.mathsisfun.com/data/images/normal-distrubution-large.gif
Pattern searching Hypothesistesting • Difference between two populations • Z-testort-test? • Whatdoesp-valuemean? • Family-wiseerror – Bonferronicorrection • More than two possibilities • Chisquaretest • Fisher’sexacttest • Morethantwovariables • ANOVA
Example 1 SAT score is related to gender • Null hypothesis • Alternative hypothesis (3 possibilities) • One or two tail? • Z or T test? • p=0.07, conclusion?
Example 2 Predictors of stroke • Age • Hypertension • Gender • …
Example 3 Genome-wide association studies • Scanning markers across the DNA of many people to find genetic variations associated with certain diseases
Pattern searching Hypothesistesting • Onevariable • Z-testort-test? • Whatdoesp-valuemean? • Family-wiseerror – Bonferronicorrection • Comparetwocategorical variables • Chisquaretest • Fisher’sexacttest • Morethantwovariables • ANOVA
Chi square Punnett Square • A cross between two pea plants yields 880 plants, 639 green, 241 yellow • Hypothesis: The green allele is dominant and both parents are heterozygous. http://www2.lv.psu.edu/jxm57/irp/chisquar.html
Chi Square • 75% green • 25% yellow
chi square Degree of freedom: number of categories – 1 = 1
Prediction • Regression • Linearregression • Multiplelinearregression • Accuracyvs.simplicity • Validation • leave-k-out http://2.bp.blogspot.com/-W7Ptp8uB02U/T8UAGm4Uw5I/AAAAAAAAC08/DcHCtLWXv-U/s1600/actnactn+1.png
example • Use brain structural measurements to predict a subject’s performance on picture vocabulary test • 144 total structural measurements • 521 subjects • First step: eliminate unnecessary variables • All zeros? • Highly correlated pairs • Variables that do not correlate well with performance score
example • Run regression • Validation: leave 1 out and leave 10 out • Principle component analysis • …
Prediction More complicatedmodels: • Baysianapproach • Usepriorknowledgetoupdateprediction • Diffusionweights • Uselocalstructuretopredictneighboringvalues
Statistical tools • EXCEL • MatLab • R • MiniTab • …
My own research • Cost-effectivenessanalysis • Mathematicalmodelinginmedicine • Simulateiterationsratherthanactualpatients