Statistics for bioinformatics

Statistics for bioinformatics Filtering microarray data

Aims of filtering • Suppose • We have a set of 10000 genes whose expression is measured on our microarray chips. • We are looking at an experiment where gene expression is measured in 11 cancer patients and 7 normal individuals. • We want to know which genes have altered expression in cancerous cells (maybe they can be used as drug targets). • Genes whose expression is similar between cancer and normal individuals are not interesting and we want to filter them out.

What will be discussed • General background on statistics • Distributions • P-values, significance • Hypothesis testing • T-test • Analysis of variance • Nonparametric statistics • Application of statistics to filtering microarray data

Distributions • Distributions help to assign probabilities to subsets of the possible set of outcomes of an experiment. • The distribution function F:[0,1] of a random variable X is given by • Random variables can be discrete or continuous. X is discrete if it takes values in a countable subset of  (eg. number of heads in two coin tosses is 0,1 or 2) and continuous if its distribution can be written as the integral of an integrable function f: • f is probability density function (pdf) of X (f=F’).

Normal distribution • Also known as Gaussian • Symmetrical about the mean, “bell-shaped” • Completely specified by mean and variance – denoted X is • Can transform to standard form, - Z is N(0,1) Pdf is: x

Central limit theorem • A lot of the statistical tests that we will discuss apply specifically for normal distributions… • …however, the central limit theorem says: • If are (independent) items from a random sample drawn from anydistribution with mean  and positive variance 2 then has a limiting distribution ( ) which is normal with mean 0 and variance 1, where

Central limit theorem • For a sample drawn from a normal distribution, is exactly normally distributed with mean 0 and variance 1. • For other distributions, is approximately normally distributed with mean 0 and variance 1, for large enough n. • This approximate normal distribution can be used to compute approximate probabilities concerning the sample mean, • In practice, convergence is v. rapid, eg. means of samples of 10 observations from uniform distribution on [0,1] are v. close to normal.

Chi-squared ( )distribution and F distribution • If are independent and N(0,1) then has a Chi-squared distribution with r degrees of freedom. • If you add chi-squared random variables, with ri degrees of freedom, i=1,…,k, you get a chi-squared random variable with degrees of freedom. • Let and be independent variates distributed as chi-squared with m and n degrees of freedom. The ratio has an F distribution with parameters m and n. • NB F distribution completely determined by m & n. • Useful for statistical tests- see later.

Statistics • What is a statistic?A function of one or more random variables that does not depend on any unknown parameter • Eg. sample mean • Z=(X-)/ is not a statistic unless  &  are known • If interested in a random variable X, may only have partial knowledge of its distribution. Can sample & use statistics to infer more info, eg. estimate unknown parameters. • Primary purpose of theory of statistics: provide mathematical models for experiments involving randomness; make inferences from noisy data.

Hypothesis testing • A statistical hypothesis is an assertion about the distribution of one or more random variables (r.v.s). • In hypothesis testing, have a null hypothesis H0 (eg. suppose we have a r.v. which we know is N(,1) & our null is that =0) which we want to test against an alternative hypothesis H1 (eg. =1). • The test is a rule for deciding based on an experimental sample – usually ask if a particular statistic is in some acceptance region or in the rejection (also called critical) region; if in acceptance region keep the null, else reject. • Test has power function which maps a potential underlying distribution for the r.v. to the probability of rejecting the null hypothesis given that distribution.

Significance and P-values • Significance level of a hypothesis test is the maximum value (actually supremum) of the power function of the test if H0 is true- ie. the worst case probability of rejecting the null if it is true. Typical values are 0.01 or 0.05 (often expressed as 1% or 5%) • NB some texts refer to 95% significance, which by my definition would be 5%. • P-value = The probability that a statistic would assume a value greater than or equal to the observed value strictly by chance. • Eg. suppose we sample 1 value from our normal distn. with variance 1 and use this as our statistic. If the sample value is 0.9, this has P-value 0.184, since P(X0.9)=0.816 for the null hypothesis N(0,1). If we were testing at 5% significance, we would keep the null, since our P-value is > 0.05.

Student t-test • Suppose you have a sample X1,…Xn of independent random variables each with distribution, N(,), then • Sample mean, , has distribution N(,/n) • Sample variance, , nS2/2 has distribution • and S2 are stochastically independent • Suppose you don’t know the actual mean and variance. If you want to test (at some significance level) whether the actual mean takes a certain value then you can’t look up P-values directly from the sample mean because you don’t know /n.

Student’s t-test • Consider instead t-ratio (t-statistic) is given by where is N(0,1) and is [S is the sample standard deviation] • So by dividing (by an estimate of the standard deviation of ), we have eliminated the unknown . • This statistic has a “t distribution with n-1 degrees of freedom”.

Student’s t-test • A one-sample t-test compares the mean of a single column of numbers against a hypothetical mean you define: • H0: =0 • H1: 0 • Assume H0 is true and calculate the t-statistic: • A P-value is calculated from the t-statistic, using the pdf. This value is a measure of the significance of the deviation of the sample (column of numbers) from the mean. Normal way of assessing significance is to use a look-up table [cf example in next section].

Two-sample T-test • A two-sample t-test compares the means of two columns of numbers (independent samples) against one another on the assumption that they are normally distributed with the same (although unknown) variance: . • Suppose we have a sample X1, …, Xn and another Y1, …, Ym drawn from N(1, 2) and N(2, 2) respectively, then the difference in sample means is distributed as N(1- 2,2(1/n+1/m)) and the t-ratio is given by

Two-sample T-test • We lay out our null and alternative hypotheses: • H0: 1= 2 • H1: 12 • Assume H0 is true and calculate the T statistic: • The T statistic follows a t-distribution with n+m-2 degrees of freedom.

Two-sample T test • From the T statistic can calculate a P-value, using the p.d.f. of a t-distribution with n+m-2 degrees of freedom. If the P-value is smaller than the desired significance level (T greater than a critical value), then reject the null hypothesis (there is a significant difference in means between the two samples). • Usually we just see if the T statistic exceeds a critical value, corresponding to some significance level, by looking up in a table. (Often significance is 5%- sometimes written 95%). • [Example in next section of lecture].

Two-sample T-test Are the sample means different? The significance of the difference in means depends on the variances. http://trochim.human.cornell.edu/kb/stat_t.htm

Analysis of variance (ANOVA) • Another test to work out if the means of a set of samples are the same is called analysis of variance (ANOVA). • Eg used for working out whether the expression of gene A in a microarray experiment is significantly different in cells from patients of cancer type A, cancer type B and in normal patients. • For two groups (eg. cancer and normal), ANOVA turns out to be equivalent to a T test, but can use ANOVA for more than two samples.

One-way ANOVA • The assumptions of analysis of variance are that the samples of interest are normally distributed, independent & have same variances, however research shows that the results of hypothesis test using ANOVA are pretty robust to the assumptions being violated. If this happens, ANOVA tends to be conservative, ie. will not reject the null hypothesis of equal means when it actually should – thus will tend to underestimate significant effects of eg. drug response. • Suppose we have m samples, with the jth sample given by , …, , from distributions N(j , 2), where 2 is the same for each but unknown. • The null hypothesis is H0: 1= 2=…= m= , unspecified., • H1: at least one mean is different.

One-way ANOVA • We will test the hypothesis using two different estimates of the variance. • One estimate (called the Mean Square Error or "MSE" for short) is based on the variances within the samples. The MSE is an estimate of s2 whether or not the null hypothesis is true. • 2nd estimate (Mean Square Between or "MSB" for short) is based on the variance of the sample means. The MSB is only an estimate of s2 if the null hypothesis is true. • If the null hypothesis is true, then MSE and MSB should be about the same since they are both estimates of the same quantity (s2); however, if H0 is false then MSB can be expected to be > MSE since MSB is estimating a quantity larger then s2.

Variance between groups • Let represent the sample mean of the jth group (sample) and the “grand mean” of all elements from all the groups. The variance between groups measures the deviations of the group means around the grand mean. • Sum of squares between groups (SSB): • [where .] • The variance between groups, also known as Mean square between (MSB) is given by sum of squares divided by the degrees of freedom between (dfB): where dfB=m-1.

Variance within groups • Here we want to know the total variance due to deviations within groups. • Sum of squares within groups (SSW): • To get the variance within, also known as mean squared error (MSE), we must divide by the degrees of freedom within dfW = N-m. Roughly speaking this is because we have used up m degrees of freedom in estimating the group means (by their sample values) and so only have N-m independent ones left to estimate this variance:

F-statistics • The F-statistic is the ratio of the variance between groups to the variance within groups: • If the F-statistic is sufficiently large then we will reject the null hypothesis that the means are equal. • The F-statistic is distributed according to an F distribution with degree of freedom for the numerator = dfB and degree of freedom for the denominator = dfW, ie. Fm-1,N-m. We can look up in an F table or calculate using the probability density function the P-value corresponding to a given value of the statistic on the distribution with parameters as given. We reject the null if this P-value is less than our significance level.

Two-way analysis of variance • What analysis of variance actually does is to split the squared deviation from the grand mean into 2 parts: • In order to estimate the mean from a sample we actually find a value which minimizes the sum of squared residuals. Eg. to find group means we use values which minimize the second term above and to find the grand mean we minimize the LHS term. • The values of these sum of squared residuals when the means take their maximum likelihood values (the variance terms above) gives a measure of the likelihood of the means taking those values. So, as we have seen, the variances can be used to see how likely certain hypotheses about the mean are.

Two-way analysis of variance • Measures of the relative sizes of the LHS term to the 2nd term tell us how good a fit the single parameter model with all means equal is compared to the multiple means model. • We use some degrees of freedom (independent sample data) to estimate the means and other d.o.f.s to see how good our hypotheses about the means are (via estimation of the variances). • Suppose now that we have 2 different factors affecting our microarray samples: eg. yeast cell in different concentrations of glucose at different temperatures. • Our model for the expression of gene A might involve both factors influencing the mean…

Two-way analysis of variance • We suppose that the sample at temperature j with glucose concentration k is N(jk,2) with • According to our model, the mean expression level can vary both with temperature and with glucose concentration. • If we want to test whether temperature affects gene expression level at 5% significance, then we take: H0: bj=0 for all j and proceed in a similar manner (although with different components of variance in the F-statistic) to before. • Clearly this can be extended to more than 2 factors- see Kerr et al (handout for homework).

Nonparametric statistics • So far we have look at statistical tests which are valid for normally-distributed data. • If we know that our data is (approximately) Gaussian (eg. in large sample-size limits by Central Limit Theorem) these are useful and easy to use. • If our data deviates a lot from normal then we need other techniques. • Nonparametric techniques make no assumptions about the underlying distributions. • We will briefly discuss such an example: a rank randomization test equivalent to the Mann-Whitney U-test.

Randomization test • Best described by example: • Group 1 Group 2 11 2 14 9 7 0 8 5 Mean 10 4 • Want to know if the two groups have significantly different means • Work out difference in means • Work out how many ways there are of dividing the total sample into two groups of four.[?] • Count how many of these lead to a bigger differences than the original two groups.

Randomization tests • Difference in means is 6 • There are 70=8!/(4!4!) ways of dividing the data • There are only two other combinations that give a difference in means which is as large or larger: • Probability of getting a difference in mean in favour of group 1 (one-tailed test) as high as the original is approximately 3/70=0.0429. There are also 3 combinations that give differences in favour of group 2 of  6. So the 2-tailed p-value is 6/70=0.0857.

Mann-Whitney U test • The problem with randomization tests is that as the number of samples and groups increases, the number of possible ways of dividing becomes extremely large- thus randomization tests are hard to compute. • A simplification involves replacing the data by ranks (ie. the smallest value is replaced by 1, the next by 2, …). A randomization test is then performed on the ranks: Group 1 Group 2 11 2 14 9 7 0 8 5 Group 1 Group 2 7 2 8 6 4 1 5 3

Rank randomization test • Calculate the difference in the summed ranks of the two groups: 12 here. • The problem is then to work out how many of the 70 ways of rearranging the numbers 1,…,8 into two groups give a difference in group sum which is  12 (one-tailed; has modulus  12 for two-tailed). • This problem doesn’t depend on the exact data, so standard values can be tabulated. For a given data set just use a lookup table. • The rank randomization test for the differences between two groups is called the Wilcoxon Rank Sum test. It is the same as the Mann-Whitney U-test although this uses a different test statistic. • Clearly information is lost in converting from real data to ranks so the test is not as powerful as randomization tests, but is easier to compute.

Statistics summary • We have discussed several ways of assessing significant differences between the means of two or more samples. • For normally distributed samples with equal variances, we have two methods: • T test (for comparing two samples) • Analysis of variance (for comparing two or more groups) • The central limit theorem shows that the mean of a very large sample follows an approximately normal distribution, however, for small & non-normally distributed samples non-parametric methods may be necessary.

Statistics summary • These techniques are useful in analysing microarray data because we want to infer from noisy data which genes vary significantly in their expression over a variety of conditions - NB since the conditions correspond to the groups, we will generally need several repeats of the microarray experiments under the “same” conditions in order to apply these techniques. [References for more info on statistics, esp. statistical tests: Introduction to mathematical statistics by Hogg & Craig (Maxwell Macmillan); http://davidmlane.com/hyperstat/index.html ]

Statistics for bioinformatics

Statistics for bioinformatics

Presentation Transcript

Tools for BioInformatics

Statistics in Bioinformatics

Bioinformatics approaches for…

Computational Statistics with Application to Bioinformatics

Bioinformatics for Research

Computational Statistics with Application to Bioinformatics

Computational Statistics with Application to Bioinformatics

Computational Statistics with Application to Bioinformatics

Bioinformatics for Biologists

Computational Statistics with Application to Bioinformatics

Computational Statistics with Application to Bioinformatics

Computational Statistics with Application to Bioinformatics

Computational Statistics with Application to Bioinformatics

Statistics for Molecular Biology and Bioinformatics

Computational Statistics with Application to Bioinformatics

Statistics for bioinformatics

Computational Statistics with Application to Bioinformatics

Portals for Bioinformatics

Computational Statistics with Application to Bioinformatics

Computational Statistics with Application to Bioinformatics

Computational Statistics with Application to Bioinformatics

Statistics for Molecular Biology and Bioinformatics