240 likes | 362 Views
Course in Statistics and Data analysis. Course B, September 2009 Stephan Frickenhaus. Outline theses. my experience is: Many young researchers lack knowledge of analysis tools, so producing/sampling data is not the problem but analysing gets a problem right before publication.
E N D
Course in Statisticsand Data analysis Course B, September 2009 Stephan Frickenhaus
Outline theses my experience is: Many young researchers lack knowledge of analysis tools, so producing/sampling data is not the problem but analysing gets a problem right before publication. Once, appropriate tools are known (and: Excel is not approriate for analysis), still knowledge of methods/conceptsmay be missing. This course tries to tackle both …
schedule Day 1: 8.9., 10:00 - 16:00 Room E4005 The probability distribution, The p-value concept, statistical tests in R Day 2: 9.9., 10:00 - 16:00 Room E4005 Multivariate Analysis, Correlation tests, ANOVA, Ordination with factors and environmental data Cluster-Analysis (maybe as start of Day 3) Day 3: 10.9., 10:00 - 16:00 Glaskasten F User-driven interactive: bring your project data and we work on it
Contents / Setup • Tool-based (program „R“) course • Install „R“ from www.r-project.org • Exploring data analysis • Graphically • Numerically • Exploring what significance really is • Statistics tests no longer as black-boxes
DAY1 – Lecture part I • With each type of data we have different methods to analyse, give examples! type of data examples Data Linear: Length in cm Circular: Angle in degree Numerical (metric) data Sex, Colour, Species Nominal (class) data Ordinal (ranked) data Age group, school class, phase in cell-division
First steps from data … • Plot in a co-ordinate system (scatter-plot),histogram, boxplot • Count in a table, barplot, piechart • Count in a table, with an axis, barplot Linear: Length in cm Circular: Angle in degree Sex, Colour, Species Age group, school class, phase in cell-division
… to methods • Check for groups, trends, correlations • Check for differences, ratios • Check for differences, ratios, relation to order • Plot in a co-ordinate system (scatter-plot),histogram, boxplot • Count in a table, barplot, piechart • Count in a table, with an axis, barplot metric nomiinal ordinal
…to combinations of data • X-Y-Plots metric metric • Class=color in scatter plot • Check for groups/clusters nomiinal metric • X-Y-plot with colors=class ordinal metric metric
…towards models: multivariate data • Organize data in tables • Keep data of same measurement in ONE row • Distinguish groups in extra column by nominal data
Before discussing, what we can do with such a table, lets do first steps in the tool R!
Start Practice with R www.r-project.org http://ftp5.gwdg.de/pub/misc/cran/
Lecture part II • What, if the summary of data is not enough? E.g., we want to say, whether an observed mean value is probably greater than 0.5? • It is not enough to conclude „We clearly find mean(x)<mean(y)“because this may be an outcome due to small sample sizes, and in reality the means may be equal, and there is maybe no effect at all. • We must define some terms to learn how to be more quantitative about such statements, like „with 1% error we can exclude that x and y are from the same population“
Some terms… • Population : • all individuals of the kind measured • If we measure them all, we know exactly the mean value etc., the true mean • Some times we do not have it accessible • Sometimes we think it has infinitely many individuals • Sample : • A subset of individuals from a population • It has, e.g., a sample mean that is not equal to the true mean (the mean of the population) • sample size : number of individuals picked
…more terms, for real numbered variables X Probability density function p(x) the probability to pick samples xi from X in the interval [a,b] Cumulative distribution function cdf(x) probability to pick an x below a
p(x) prob density function p(x) x a b Full range of X makes 100% p(x)>=0 Need not be symmetric!
cumulative distr. function cdf(x) 1 x max(X) min(X) cdf starts from 0 at the minimal possible value of X, reaches 1 at the maximal possible value of X. Here p drops to 0. cdf is monotonically increasing, because it integrates a p≥0.
Mean E and Standard deviation S p(x) x E(X), need not be at the maximum of p(x) S(X) measures somehow the width of p(x), i.e., the scattering of x around E(x).
Long-tail distributions p(x) x Some rare samples will have very large values x ! When we have few samples, we pick from these rare values maybe none!
What is a statistics test? • Example: We have a sample x of size 6. • How probable is it, that the mean of the sample x is between 2 and 2.5, although E(X)=0? • To answer this: • 1) we repeat many times taking samples of size 6 and count how often. • 2) we need an assumption about the probability density of X and then integrate a statistics distribution of mean(x) to measure Pr(2<mean(x)<2.5) May be too expensive LATER:Can I check what the pdf of X is?
…influence of sample size on the mean • repeat a sampling from X with sd(X)=1.0 at different sizes N • Take sample means • How do repeated means vary (standard deviation) • Result… • For high N, sd(mean) goes (central limit theorem) How for low N ??? Its given by the t-statistics t = mean(x)/(sd(x)/sqrt(N)), which depends on sample size N.
A first test:Test the influence of sample size • How do I know how many samples I need to make a correct statement about the mean like E(X)≥0.89? • „correct“ is to be quantified as the „type-I error“:How probable is it that I see the same or more extreme value by chance alone, i.e., although the population mean is 0 ? Concept of the Null-Hypothesis How shure can I be to exclude, that the population mean is not zero, also when I find a sample mean of m=0.89. So, we evaluate how probable such an outcome is, when a certain pdf(X), e.g., the normal distribution, which has an E(X)=0. To evaluate this Pr, we need a test-statistic t for it and a distribution pdf(t) to integrate for Pr.
T-statistics • T has a complicated mathematical, its graph is similar to bell-shaped curve. • It has for small sample size N longer tails (green) Blue area= Pr(T<3) Pr(T>=3)
T is known in R Sample size -1 Test for sample x=c(1,2) Pr(t<3), for n=2 Upper boundary 3? t=mean(x)/sd(x)*sqrt(2) =1.5/0.707*1.44=3.0 So, ~90 from 100 repeated samples will give mean below 1.5 1-pt(3,df=1) = 0.1024164 is the chance to have mean(x) greater 1.5 ! (remember, N=2), Under the assumption that x is drawn from a population with mean 0 !
Now the test itself: • We have a sample size 2 The Null-Hypothesis Our sample is from a population with mean 0. The test that checks this is in R… Ignore this 0