450 likes | 1.14k Views
Introduction to Bioinformatics 5 . Statistical Analysis of Gene Expression Matrices I. Course 341 Department of Computing Imperial College, London Moustafa Ghanem. Lecture Overview. Motivation Identifying differentially expressed genes Calculating effect: fold ratio
E N D
Introduction to Bioinformatics5. Statistical Analysis of Gene Expression Matrices I Course 341 Department of Computing Imperial College, London Moustafa Ghanem
Lecture Overview • Motivation • Identifying differentially expressed genes • Calculating effect: fold ratio • Calculating significance: p-values • Statistical Analysis • Paired and unpaired experiments • Need for significance testing • Hypothesis testing • t-tests and p-values • t-tests • Paired and unpaired t-tests • Formulae for t-test • Single-tail vs. two tails t-tests • Looking up p-values
MotivationLarge-scale Differential Gene Expression Analysis • Consider a microarray experiment • that measures gene expression in two groups of rat tissue (>5000 genes in each experiment). • The rat tissues come from two groups: • WT: Wild-Type rat tissue, • KO: Knock Out Treatment rat tissue • Gene expression for each group measured under similar conditions • Question: Which genes are affected by the treatment? How significant is the effect? How big is the effect?
Calculating Expression Ratios • In Differential Gene Expression Analysis, we are interested in identifying genes with different expression across two states, e.g.: • Tumour cell lines vs. Normal cell lines • Treated tissue vs. diseased tissue • Different tissues, same organism • Same tissue, different organisms • Same tissue, same organism • Time course experiments • We can quantify the difference (effect) by taking a ratio • i.e. for gene k, this is the ratio between expression in state a compared to expression in state b • This provides a relative value of change (e.g. expression has doubled) • If expression level has not changed ratio is 1
A gene is up-regulated in state 2 compared to state 1 if it has a higher value in state 2 A gene is down-regulated in state 2 compared to state 1 if it has a lower value in state 2 Fold change(Fold ratio) • Ratios are troublesome since • Up-regulated & Down-regulated genes treated differently • Genes up-regulated by a factor of 2 have a ratio of 2 • Genes down-regulated by same factor (2) have a ratio of 0.5 • As a result • down regulated genes are compressed between 1 and 0 • up-regulated genes expand between 1 and infinity • Using a logarithmic transform to the base 2 rectifies problem, this is typically known as the fold change
A, B and D are down regulated C is up-regulated E has no change Examples of fold change • You can calculate Fold change between pairs of expression values: • e.g. Between State 1 vs State 2 for gene A • Or Between mean values of all measurements for a gene in the WT/KO experiments • mean(WT1..WT4) vs mean (KO1..KO4)
StatisticsBack to our problems 4 Wild KO samples (Red) Columns represent samples 4 Wild Type samples (Blue) 5000 Rows represent genes
StatisticsSignificance ofFold Change • For our problem we can calculate an average fold ratio for each gene (each row) • This will give us an average effect value for each gene • 2, 1.7, 10, 100, etc • Question which of these values are significant? • Can use a threshold, but what threshold value should we set? • Use statistical techniques based on number of members in each group, type of measurements, etc -> significance testing.
Statistics:5000 separate statistical problems • How do we think about this problem? • Effectively: • 5000 separate experiments where each experiment measures the expression of one gene in two groups of 4 individuals • For each experiment (gene), want to establish if there is a statistical difference between the reported values in each group • We then want to identify those genes (across the 5000 genes) that have a significant change • Each row in our table is similar to one of those of traditional statistical analysis problems
Condition Condition Group 2 members Group 1 members StatisticsUnpaired statistical experiments • Overall setting: 2 groups of 4 individuals each • Group1: Imperial students • Group2: UCL students • Experiment 1: • We measure the height of all students • We want to establish if members of one group are consistently (or on average) taller than members of the other, and if the measured difference is significant • Experiment 2: • We measure the weight of all students • We want to establish if members of one group are consistently (or on average) heavier than the other, and if the measured difference is significant • Experiment 3: • ………
Condition Condition Group 2 members Group 1 members StatisticsUnpaired statistical experiments • In unpaired experiments, you typically have two groups of people that are not related to one another, and measure some property for each member of each group • e.g. you want to test whether a new drug is effective or not, you divide similar patients in two groups: • One groups takes the drug • Another groups takes a placebo • You measure (quantify) effect of both groups some time later • You want to establish whether there is a significant difference between both groups at that later point • The WT/KO example is an unpaired experiment if the rats in the experiments are different !
StatisticsUnpaired statistical experiments • The WT/KO example is an unpaired experiment if the rats in the experiments are different!
StatisticsUnpaired statistical experiments • How do we address the problem? • Compare two sets of results (alternatively calculate mean for each group and compare means) • Graphically: • Scatter Plots • Box plots, etc • Compare Statistically • Use unpaired t-test Are these two series significantly different? Are these two series significantly different?
Condition 1 Condition 2 Group members StatisticsPaired statistical experiments • Overall setting: 1 groups of 4 individuals each • Group1: Imperial students • We make measurements for each student in two situations • Experiment 1: • We measure the height of all students before Bioinformatics course and after Bioinformatics course • We want to establish if Bioinformatics course consistently (or on average) affects students’ heights • Experiment 2: • We measure the weight of all students before Bioinformatics course and after Bioinformatics • We want to establish if Bioinformatics course consistently (or on average) affects students’ weights • Experiment 3: • ………
Condition 1 Condition 2 Group members StatisticsPaired statistical experiments • In paired experiments, you typically have one group of people, you typically measure some property for each member before and after a particular event (so measurement come in pairs of before and after) • e.g. you want to test the effectiveness of a new cream for tanning • You measure the tan in each individual before the cream is applied • You measure the tan in each individual after the cream is applied • You want to establish whether the there is a significant difference between measurements before and after applying the cream for the group as a whole
StatisticsPaired statistical experiments • The WT/KO example is a paired experiment if the rats in the experiments are the same!
StatisticsPaired statistical experiments • How do we address the problem? • Calculate difference for each pair • Compare differences to zero • Alternatively (compare average difference to zero) • Graphically: • Scatter Plot of difference • Box plots, etc • Statistically • Use unpaired t-test Are differences close to Zero?
StatisticsSignificance testing • In both cases (paired and unpaired) you want to establish whether the difference is significant • Significance testing is a statistical term and refers to estimating (numerically) the probability of a measurement occurring by chance. • To do this, you need to review some basic statistics • Normal distributions: mean, standard deviations, etc • Hypothesis Testing • t-distributions • t-tests and p-values
68% of dist. 1 s.d. 1 s.d. x Mean and standard deviation • Mean and standard deviation tell you the basic features of a distribution • mean = average value of all members of the group u = (x1+x2+x3 ….+xN)/N • standard deviation = a measure of how much the values of individual members vary in relation to the mean • The normal distribution is symmetrical about the mean 68% of the normal distribution lies within 1 s.d. of the mean
Note on s.d. calculation • Through the following slides and in the tutorials, I use the following formula for calculating standard deviation • Some people use the unbiased form below (for good reasons) • Please use the simple form if you want the answers to add up at the end
68% of dist. 1 s.d. 1 s.d. x The Normal Distribution Many continuous variables follow a normal distribution, and it plays a special role in the statistical tests we are interested in; • The x-axis represents the values of a particular variable • The y-axis represents the proportion of members of the population that have each value of the variable • The area under the curve represents probability – i.e. area under the curve between two values on the x-axis represents the probability of an individual having a value in that range
Any normal distribution can be transformed to a standard distribution (mean 0, s.d. = 1) using a simple transform 0.025 = p-value: probability of a measurement value belonging to this distribution Normal Distribution and Confidence Intervals a/2 = 0.025 a/2 = 0.025 1-a = 0.95 -1.96 1.96
In unpaired experiments, we compare the difference between the means. Hypothesis Testing: (Unpaired)Are two data sets different • We use z-test (normal distribution) if the standard deviations of two populations from which the data sets came are known (and are the same) • We pose a null hypothesis that the means are equal • We try to refute the hypothesis using the curves to calculate the probability that the null hypothesis is true (both means are equal) • if probability is low (low p) reject the null hypothesis and accept the alternative hypothesis (both means are different) • If probability is high (high p) accept null hypothesis (both means are equal) Ho Population 1 Population 2 Ha Population 1 Population 2 If standard deviation known use z test, else use t-test
Comparing Two SamplesGraphical interpretation • To compare two groups you can compare the mean of one group graphically. • The graphical comparison allows you to visually see the distribution of the two groups. • If the p-value is low, chances are there will be little overlap between the two distributions. If the p-value is not low, there will be a fair amount of overlap between the two groups. • We can set a critical value for the x-axis based on the threshold of p-value
In paired experiments, we compare the mean difference. Hypothesis Testing: (Paired)Are two data sets different • We use z-test (normal distribution) if the standard deviations of two populations from which the data sets came are known • We pose a null hypothesis that the mean difference is zero • We try to refute the hypothesis using the curves to calculate the probability that the null hypothesis is true (mean of difference is 0) • if probability is low (low p) reject the null hypothesis and accept the alternative hypothesis (mean of difference <>0) • If probability is high (high p) accept null hypothesis (mean of difference is 0) Ho Population 1 Population 2 Ha Population 1 Population 2 If standard deviation known use z test, else use t-test
Typically known as Student t-test The t-test • In most cases we use what is know as a t-test rather than the z-test when comparing samples. • In particular when we have • small data sets (less than 30 each) and • we don’t know the s.d. and have to calculate it from the small samples • Same concepts as before apply, but we base the test on what is known as the t-distribution, which approximates the normal distribution for small samples • We have to calculate what is know as a t-value!
We will see how we calculate the degrees of freedom in a short while The t-distribution • In fact we have many t-distributions, each one is calculated in reference to the number of degrees of freedom (d.f.)also know as variables (v) Normal distribution t-distribution
t-test terminology • t-test: Used to compare the mean of a sample to a known number (often 0). • Assumptions: Subjects are randomly drawn from a population and the distribution of the mean being tested is normal. • Test: The hypotheses for a single sample t-test are: • Ho: u = u0 • Ha: u < > u0 • p-value: probability of error in rejecting the hypothesis of no difference between the two groups. (where u0 denotes the hypothesized value to which you are comparing a population mean)
t-Tests terminology Single-tail vs. two-tail • What am I testing for: • Right Tail: (group1 > group2) • Left Tail: (group1 < group2) • Two Tail: Both groups are different but I don’t care how. H0: m 1£ m 2 H1: m 1 > m 2 H0: m 1 - m 2£ 0 H1: m 1 - m 2> 0 Right Tail OR H1: m 1 < m 2 H0: m 1 - m 2³ 0 H1: m 1 - m 2 < 0 OR Left Tail H0: m 1³ m 2 H0: m 1 = m 2H1: m 1¹m 2 H0: m 1 -m 2 = 0 H1: m 1 - m 2 ¹ 0 Two Tail OR
t-test terminologyUnpaired vs. paired t-test • Same as before !! Depends on your experiment • Unpaired t-Test: The hypotheses for the comparison of two independent groups are: • Ho: u1 = u2 (means of the two groups are equal) • Ha: u1 <> u2 (means of the two group are not equal) • Paired t-test: The hypothesis of paired measurements in same individuals • Ho: D = 0 (the difference between the two observations is 0) • Ha: D <> 0 (the difference is not 0)
Where d is calculated by Remember these formulae !! Calculating t-test (t statistic) • First calculate t statistic value and then calculate p value For the paired t-test, t is calculated using the following formula: And n is the number of pairs being tested. • For an unpaired (independent group) t-test, the following formula is used: Where σ(x) is the standard deviation of x andn (x) is the number of elements in x.
Calculating p-value for t-test • When carrying out a test, a P-value can be calculated based on the t-value and the ‘Degrees of freedom’. • There are three methods for calculating P: • One Tailed >: • One Tailed <: • Two Tailed: • Where p(t,v) is looked up from the t-distribution table • The number of degrees (v) of freedom is calculated as: • UnPaired: n(x) +n (y) -2 • Paired: n- 1 (where n is the number of pairs.)
p-values • Results of the t-test: If the p-value associated with the t-test is small (usually set at p < 0.05), there is evidence to reject the null hypothesis in favour of the alternative. • In other words, there is evidence that the mean is significantly different than the hypothesized value. If the p-value associated with the t-test is not small (p > 0.05), there is not enough evidence to reject the null hypothesis, and you conclude that there is evidence that the mean is not different from the hypothesized value.
Calculating t and p values • You will usually use a piece of software to calculate t and P • (Excel provides that !). • In a problems • You can assume access to a function p(t,v) which calculates p for a given t value and v (number of degrees of freedom) • or alternatively have a table indexed by critical t values and v
t-value and p-value • Given a t-value, and degrees of freedom, you can look-up a p-value • Alternatively, if you know what p-value you need (e.g. 0.05) and degrees of freedom you can set the threshold for critical t
Reject H Reject H 0 0 .025 .025 t -2.0154 0 2.0154 t-test Interpretation Note as t increases, p decreases t (value) must > t (critical on table) by P level
A A = .05 = .05 -tc Finding a critical t • The table provides the t values (tc) for which P(tx > tc) = A tc =-1.812 =1.812 t.100 t.05 t.025 t.01 t.005
Summary • Differential analysis • Uses fold ratio (fold change) for measuring effect • Need some measure of significance of such effect. • Statistical analysis • Paired vs. unpaired experiments • t-tests • Calculating t for paired/un-paired experiments • Deciding single tail vs. two-tail • Calculating degrees of freedom • Look-up p value