Course in Statistics and Data analysis

Course in Statisticsand Data analysis Course B DAY2 September 2009 Stephan Frickenhaus www.awi.de/en/go/bioinformatics

DAY2 How to import data from excel Multivariate data in plots, linear models ANOVA Ideas of Clustering and Modelling

Import of Data from Excel • Change in R to the directory where the „tab1.txt“ is (File->Change Dir.). • Load in R into a variable V V=read.table(file=“tab1.txt“,header=T) • You may use the column named „day“ as row-names: V=read.table(file=“tab1.txt“,header=T,row.names=“day“) • Copy a rectangular part (or all) from your table • Paste into a TEXT-file in the Windows-EDITOR • check column-names • Save as „tab1.txt“

Problems • In case of prblems with decimal `,` or `.`: • Tell R which is the decimal point in read.table • If you get a text-file with commas, not tabs separating columns: V=read.table(…, dec=“.“) V=read.table(…, sep=“,“)

Saving result tables • R-analysis results, e.g., from filtering etc. are sometimes exported to text-files- can be imported in Excel or R later • Do this without quotes for each entry:write.table(V, file=“res.txt“, quote=F) • Save only 2 desired columns („size“ and „class“): • write.table(rbind(V$size,V$class), file=“res2.txt“)

Multivariate Data • Suppose we have Diameter and Height of Diatoms measured • Work with „diatoms.txt“ • What is the relation between these? It one dependent on the other? What is the strategy of the organism?

Correlation test • Is there a significant correlation? cor.test(D,H) Checks if the observed correlation is significant non-zeroWe find negative corr., near -1 (strong) A good p-value shows significant correlation.

Text • We can conclude that these diatoms show a special trend: increasing height, when decreasing diameter. • What does this mean? Can we say that this has a compensating function? • It could be that the cell does maintain volume (centric shape). • Volume V=R^2*pi*H = 1/4 D^2 *pi * H • So we expect a linear relation between H and 1/D^2 • we need a regression… …it is found in R: lm(Y~X) Try ?lm to see how. See „diatoms.R“

Linear models • To fit a model to data • Suppose we have a sample of measured (y,x1,x2,x3) • The simplest model showing influence of all 3 x has the form y=a*x1+b*x2+c*x3+d • Coefficients a,b,c,d obtained from lm(y~x1+x2+x3) • Each coefficients value may be non-significant, so it could as well be set to zero. • summary(lm()) shows these significances

Check „lm.R“ The data y was created with coefficients 1 , 1 , 0.5 and a random term runif/3 We see estimates of these coefficients from the fit under „Estimate“. Now, we could write the fitted model as y.fit(x1,x2,x3)=0.26826+1.00595*x1+1.01167*x2+0.47311*x3 If you want no intercept, use y~x1+x2+x3-1 Use this to draw a ± error bar around the y.fit

conclusions • Variables x with significant coefficients, i.e., Pr(|t|>)<alpha, are said to have an effect on y. • Sometimes there are relations between the explaining varibles, say x1 and x2 are correlated, like x2=2*x1. • Then, y=c1*x1+c2*x2 can be reduced like Y=(c1+2)*x1

ANOVA • With two different treatments we make the t-test to compare means. • The influence of a factor/treatment with more than 2 variants is commonly analysed by ANOVA, i.e., more than two means are compared at the same time. • The Null is that all samples means are from the same pop [the treatment has no effect].

ANOVA • In R ist like linear models, but with factors that influence the means. • See dataset ANOVA.txt • Try aov(y~f.c) A weak p, effect may be unclear because of the other factors

But which means do differ? • f.c has 3 levels. • We are not allowed to look at the means of each level. • We must make all pairwise comparisons for significance • This is known as „post-hoc“-test • One is TukeyHSD • It gives a table of pairwise tests of means • Since data is used more than once, well discover more likely some effect. • HSD corrects p-values for multiple-tests

Post-hoc Almost significant effect, comparing group 1 with 0 adjusted p for 3 tests

A graphical view plot(y~f.c)

Compare with a T-test So, the adjusted p-value 0.06 from HSD is greater

Ideas of clustering and modeling • Clustering is a way to detect/display groups in data that might point to a factor which affects the sample. • Different ways: • Mapping: • plot multivariate data in a special way to see groups • Discriminant analysis:use a known factor (e.g., strain) to find a maping that best seperates the known groups • Use the discriminant to classify new data !!!

PCA • Download data PCA.txt • See PCA.R to make a PCA for that multivariate data • PC1 is rotated data, with maximal variance • PC2 has smaller variance we could separate / discriminate with this line

Linear Discriminant new data (squares) classified (predicted) accoring to the LDA the original 3-class 3D-data in a 2D LDA check LDA.R and LDA.txt to see similar results

Course in Statistics and Data analysis

Course in Statistics and Data analysis

Presentation Transcript

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Course in Statistics and Data analysis