230 likes | 405 Views
Course in Statistics and Data analysis. Course B DAY2 September 2009 Stephan Frickenhaus www.awi.de/en/go/bioinformatics. DAY2. How to import data from excel Multivariate data in plots, linear models ANOVA Ideas of Clustering and Modelling. Import of Data from Excel.
E N D
Course in Statisticsand Data analysis Course B DAY2 September 2009 Stephan Frickenhaus www.awi.de/en/go/bioinformatics
DAY2 How to import data from excel Multivariate data in plots, linear models ANOVA Ideas of Clustering and Modelling
Import of Data from Excel • Change in R to the directory where the „tab1.txt“ is (File->Change Dir.). • Load in R into a variable V V=read.table(file=“tab1.txt“,header=T) • You may use the column named „day“ as row-names: V=read.table(file=“tab1.txt“,header=T,row.names=“day“) • Copy a rectangular part (or all) from your table • Paste into a TEXT-file in the Windows-EDITOR • check column-names • Save as „tab1.txt“
Problems • In case of prblems with decimal `,` or `.`: • Tell R which is the decimal point in read.table • If you get a text-file with commas, not tabs separating columns: V=read.table(…, dec=“.“) V=read.table(…, sep=“,“)
Saving result tables • R-analysis results, e.g., from filtering etc. are sometimes exported to text-files- can be imported in Excel or R later • Do this without quotes for each entry:write.table(V, file=“res.txt“, quote=F) • Save only 2 desired columns („size“ and „class“): • write.table(rbind(V$size,V$class), file=“res2.txt“)
Multivariate Data • Suppose we have Diameter and Height of Diatoms measured • Work with „diatoms.txt“ • What is the relation between these? It one dependent on the other? What is the strategy of the organism?
Correlation test • Is there a significant correlation? cor.test(D,H) Checks if the observed correlation is significant non-zeroWe find negative corr., near -1 (strong) A good p-value shows significant correlation.
Text • We can conclude that these diatoms show a special trend: increasing height, when decreasing diameter. • What does this mean? Can we say that this has a compensating function? • It could be that the cell does maintain volume (centric shape). • Volume V=R^2*pi*H = 1/4 D^2 *pi * H • So we expect a linear relation between H and 1/D^2 • we need a regression… …it is found in R: lm(Y~X) Try ?lm to see how. See „diatoms.R“
Linear models • To fit a model to data • Suppose we have a sample of measured (y,x1,x2,x3) • The simplest model showing influence of all 3 x has the form y=a*x1+b*x2+c*x3+d • Coefficients a,b,c,d obtained from lm(y~x1+x2+x3) • Each coefficients value may be non-significant, so it could as well be set to zero. • summary(lm()) shows these significances
Check „lm.R“ The data y was created with coefficients 1 , 1 , 0.5 and a random term runif/3 We see estimates of these coefficients from the fit under „Estimate“. Now, we could write the fitted model as y.fit(x1,x2,x3)=0.26826+1.00595*x1+1.01167*x2+0.47311*x3 If you want no intercept, use y~x1+x2+x3-1 Use this to draw a ± error bar around the y.fit
conclusions • Variables x with significant coefficients, i.e., Pr(|t|>)<alpha, are said to have an effect on y. • Sometimes there are relations between the explaining varibles, say x1 and x2 are correlated, like x2=2*x1. • Then, y=c1*x1+c2*x2 can be reduced like Y=(c1+2)*x1
ANOVA • With two different treatments we make the t-test to compare means. • The influence of a factor/treatment with more than 2 variants is commonly analysed by ANOVA, i.e., more than two means are compared at the same time. • The Null is that all samples means are from the same pop [the treatment has no effect].
ANOVA • In R ist like linear models, but with factors that influence the means. • See dataset ANOVA.txt • Try aov(y~f.c) A weak p, effect may be unclear because of the other factors
But which means do differ? • f.c has 3 levels. • We are not allowed to look at the means of each level. • We must make all pairwise comparisons for significance • This is known as „post-hoc“-test • One is TukeyHSD • It gives a table of pairwise tests of means • Since data is used more than once, well discover more likely some effect. • HSD corrects p-values for multiple-tests
Post-hoc Almost significant effect, comparing group 1 with 0 adjusted p for 3 tests
A graphical view plot(y~f.c)
Compare with a T-test So, the adjusted p-value 0.06 from HSD is greater
Ideas of clustering and modeling • Clustering is a way to detect/display groups in data that might point to a factor which affects the sample. • Different ways: • Mapping: • plot multivariate data in a special way to see groups • Discriminant analysis:use a known factor (e.g., strain) to find a maping that best seperates the known groups • Use the discriminant to classify new data !!!
PCA • Download data PCA.txt • See PCA.R to make a PCA for that multivariate data • PC1 is rotated data, with maximal variance • PC2 has smaller variance we could separate / discriminate with this line
Linear Discriminant new data (squares) classified (predicted) accoring to the LDA the original 3-class 3D-data in a 2D LDA check LDA.R and LDA.txt to see similar results