1 / 37

Linear Modelling I

Linear Modelling I. Richard Mott Wellcome Trust Centre for Human Genetics. Synopsis. Linear Regression Correlation Analysis of Variance Principle of Least Squares. Correlation. Correlation and linear regression. Is there a relationship? How do we summarise it? Can we predict new obs?

Download Presentation

Linear Modelling I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Modelling I Richard Mott Wellcome Trust Centre for Human Genetics

  2. Synopsis • Linear Regression • Correlation • Analysis of Variance • Principle of Least Squares

  3. Correlation

  4. Correlation and linear regression • Is there a relationship? • How do we summarise it? • Can we predict new obs? • What about outliers?

  5. Correlation Coefficient r • r=1 perfect positive linear • r=-1 perfect negative linear • -1 < r < 1 • r=0 no relationship • r=0.6

  6. Examples of Correlation(taken from Wikipedia)

  7. Calculation of r • Data

  8. Correlation in R > cor(bioch$Biochem.Tot.Cholesterol,bioch$Biochem.HDL,use="complete") [1] 0.2577617 > cor.test(bioch$Biochem.Tot.Cholesterol,bioch$Biochem.HDL,use="complete") Pearson's product-moment correlation data: bioch$Biochem.Tot.Cholesterol and bioch$Biochem.HDL t = 11.1473, df = 1746, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.2134566 0.3010088 sample estimates: cor 0.2577617 > pt(11.1473,df=1746,lower.tail=FALSE) # T distribution on 1746 degrees of freedom [1] 3.154319e-28

  9. Linear Regression Fit a straight line to data • a intercept • b slope • ei error • Normally distributed • E(ei) = 0 • Var(ei) = s2

  10. Example: simulated data R code > # simulate 30 data points > x <- rnorm(30) > e <- rnorm(30) > x <- 1:30 > e <- rnorm(30,0,5) > y <- 1 + 3*x + e > # fit the linear model > f <- lm(y ~ x) > # plot the data and the predicted line > plot(x,y) > abline(reg=f) > print(f) Call: lm(formula = y ~ x) Coefficients: (Intercept) x -0.08634 3.04747

  11. Least Squares • Estimate a, b by least squares • Minimise sum of squared residuals between y and the prediction a+bx • Minimise

  12. Why least squares? • LS gives simple formulae for the estimates for a, b • If the errors are Normally distributed then the LS estimates are “optimal” In large samples the estimates converge to the true values No other estimates have smaller expected errors LS = maximum likelihood • Even if errors are not Normal, LS estimates are often useful

  13. Analysis of Variance (ANOVA) LS estimates have an important property: they partition the sum of squares (SS) into fitted and error components • total SS = fitting SS + residual SS • only the LS estimates do this

  14. ANOVA in R > anova(f) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 1 20872.7 20872.7 965 < 2.2e-16 *** Residuals 28 605.6 21.6 > pf(965,1,28,lower.tail=FALSE) [1] 3.042279e-23

  15. Hypothesis testing • no relationship between y and x • Assume errors ei are independent and normally distributed N(0,s2) • If H0 is true then the expected values of the sums of squares in the ANOVA are • degrees freedom • Expectation • F ratio = (fitting MS)/(residual MS) ~ 1 under H0 • F >> 1 implies we reject H0 • F is distributed as F(1,n-2)

  16. Degrees of Freedom • Suppose are iid N(0,1) • Then ie n independent variables • What about ? • These values are constrained to sum to 0: • Therefore the sum is distributed as if it comprised one fewer observation, hence it has n-1 df (for example, its expectation is n-1) • In particular, if p parameters are estimated from a data set, then the residuals have p constraints on them, so they behave like n-p independent variables

  17. The F distribution • If e1….en are independent and identically distributed (iid) random variables with distribution N(0,s2), then: • e12/s2 … en2/s2 are each iid chi-squared random variables with chi-squared distribution on 1 degree of freedom c12 • The sum Sn = Si ei2/s2 is distributed as chi-squared cn2 • If Tm is a similar sum distributed as chi-squared cm2, but independent of Sn, then (Sn/n)/(Tm/m) is distributed as an F random variable F(n,m) • Special cases: • F(1,m) is the same as the square of a T-distribution on m df • for large m, F(n,m) tends to cn2

  18. ANOVA – HDL example HDL = 0.2308 + 0.4456*Cholesterol > ff <- lm(bioch$Biochem.HDL ~ bioch$Biochem.Tot.Cholesterol) > ff Call: lm(formula = bioch$Biochem.HDL ~ bioch$Biochem.Tot.Cholesterol) Coefficients: (Intercept) bioch$Biochem.Tot.Cholesterol 0.2308 0.4456 > anova(ff) Analysis of Variance Table Response: bioch$Biochem.HDL Df Sum Sq Mean Sq F value Pr(>F) bioch$Biochem.Tot.Cholesterol 1 149.660 149.660 1044 Residuals 1849 265.057 0.143 > pf(1044,1,28,lower.tail=FALSE) [1] 1.040709e-23

  19. correlation and ANOVA • r2= FSS/TSS = fraction of variance explained by the model • r2 = F/(F+n-2) • correlation and ANOVA are equivalent • Test of r=0 is equivalent to test of b=0 • T statistic in R cor.test is the square root of the ANOVA F statistic • r does not tell anything about magnitudes of estimates of a, b • r is dimensionless

  20. Effect of sample size on significance Total Cholesterol vs HDL data Example R session to sample subsets of data and compute correlations seqq <- seq(10,300,5) corr <- matrix(0,nrow=length(seqq),ncol=2) colnames(corr) <- c( "sample size", "P-value") n <- 1 for(i in seqq) { res <- rep(0,100) for(j in 1:100) { s <- sample(idx,i) data <- bioch[s,] co <- cor.test(data$Biochem.Tot.Cholesterol, data$Biochem.HDL,na="pair") res[j] <- co$p.value } m <- exp(mean(log(res))) cat(i, m, "\n") corr[n,] <- c(i, m) n <- n+1 }

  21. Calculating the right sample size n • The R library “pwr” contains functions to compute the sample size for many problems, including correlation pwr.r.test() and ANOVA pwr.anova.test()

  22. Problems with non-linearityAll plots have r=0.8 (taken from Wikipedia)

  23. Multiple Correlation • The R cor function can be used to compute pairwise correlations between many variables at once, producing a correlation matrix. • This is useful for example, when comparing expression of genes across subjects. • Gene coexpression networks are often based on the correlation matrix. • in R mat <- cor(df, na=“pair”) • computes the correlation between every pair of columns in df, removing missing values in a pairwise manner • Output is a square matrix correlation coefficients

  24. One-Way ANOVA • Model y as a function of a categorical variable taking p values • eg subjects are classified into p families • want to estimate effect due to each family and test if these are different • want to estimate the fraction of variance explained by differences between families – ( an estimate of heritability)

  25. One-Way ANOVA LS estimators average over group i

  26. One-Way ANOVA • Variance is partitioned in to fitting and residual SS residual SS with groups n-p degrees of freedom total SS n-1 fitting SS between groups p-1

  27. One-Way ANOVA Under Ho: no differences between groups F ~ F(p-1,n-p)

  28. One-Way ANOVA in R fam <- lm( bioch$Biochem.HDL ~ bioch$Family ) > anova(fam) Analysis of Variance Table Response: bioch$Biochem.HDL Df Sum Sq Mean Sq F value Pr(>F) bioch$Family 173 6.3870 0.0369 3.4375 < 2.2e-16 *** Residuals 1727 18.5478 0.0107 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >

  29. Factors in R • Grouping variables in R are called factors • When a data frame is read with read.table() • a column is treated as numeric if all non-missing entries are numbers • a column is boolean if all non-missing entries are T or F (or TRUE or FALSE) • a column is treated as a factor otherwise • the levels of the factor are the set of distinct values • A column can be forced to be treated as a factor using the function as.factor(), or as a numeric vector using as.numeric() • BEWARE: If a numeric column contains non-numeric values (eg “N” being used instead of “NA” for a missing value, then the column is interpreted as a factor

  30. Linear Modelling in R • The R function lm() fits linear models • It has two principal arguments (and some optional ones) • f <- lm( formula, data ) • formula is an R formula • data is the name of the data frame containing the data (can be omitted, if the variables in the formula include the data frame)

  31. formulae in R • Biochem.HDL ~ Biochem$Tot.Cholesterol • linear regression of HDL on Cholesterol • 1 df • Biochem.HDL ~ Family • one-way analysis of variance of HDL on Family • 173 df • The degrees of freedom are the number of independent parameters to be estimated

  32. ANOVA in R • f <- lm(Biochem.HDL ~ Tot.Cholesterol, data=biochem) • [OR f <- lm(biochem$Biochem.HDL ~ biochem$Tot.Cholesterol)] • a <- anova(f) • f <- lm(Biochem.HDL ~ Family, data=biochem) • a <- anova(f)

  33. Non-parametric Methods • So far we have assumed the errors in the data are Normally distributed • P-values and estimates can be inaccurate if this is not the case • Non-parametric methods are a (partial) way round this problem • Make fewer assumptions about the distribution of the data • independent • identically distributed

  34. Non-Parametric CorrelationSpearman Rank Correlation Coefficient • Replace observations by their ranks • eg x= ( 5, 1, 4, 7 ) -> rank(x) = (3,1,2,4) • Compute sum of squared differences between ranks • in R: • cor( x, y, method=“spearman”) • cor.test(x,y,method=“spearman”)

  35. Spearman Correlation > cor.test(xx,y, method=“pearson”) Pearson's product-moment correlation data: xx and y t = 0.0221, df = 28, p-value = 0.9825 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.3566213 0.3639062 sample estimates: cor 0.004185729 > cor.test(xx,y,method="spearman") Spearman's rank correlation rho data: xx and y S = 2473.775, p-value = 0.01267 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.4496607

  36. Non-ParametricOne-Way ANOVA • Kruskall-Wallis Test • Useful if data are highly non-Normal • Replace data by ranks • Compute average rank within each group • Compare averages • kruskal.test( formula, data )

  37. Permutation Testsas non-parametric tests • Example: One-way ANOVA: • permute group identity between subjects • count fraction of permutations in which the ANOVA p-value is smaller than the true p-value a = anova(lm( bioch$Biochem.HDL ~ bioch$Family)) p = a[1,5] pv = rep(0,1000) for( i in 1:1000) { perm = sample(bioch$Family,replace=FALSE) a = anova(lm( bioch$Biochem.HDL ~ perm )) pv[i] = a[1,5] } pval = mean(pv <p)

More Related