Introduction to Bioinformatics 5 . Statistical Analysis of Gene Expression Matrices I

Introduction to Bioinformatics5. Statistical Analysis of Gene Expression Matrices I Course 341 Department of Computing Imperial College, London Moustafa Ghanem

Lecture Overview • Motivation • Identifying differentially expressed genes • Calculating effect: fold ratio • Calculating significance: p-values • Statistical Analysis • Paired and unpaired experiments • Need for significance testing • Hypothesis testing • t-tests and p-values • t-tests • Paired and unpaired t-tests • Formulae for t-test • Single-tail vs. two tails t-tests • Looking up p-values

MotivationLarge-scale Differential Gene Expression Analysis • Consider a microarray experiment • that measures gene expression in two groups of rat tissue (>5000 genes in each experiment). • The rat tissues come from two groups: • WT: Wild-Type rat tissue, • KO: Knock Out Treatment rat tissue • Gene expression for each group measured under similar conditions • Question: Which genes are affected by the treatment? How significant is the effect? How big is the effect?

Calculating Expression Ratios • In Differential Gene Expression Analysis, we are interested in identifying genes with different expression across two states, e.g.: • Tumour cell lines vs. Normal cell lines • Treated tissue vs. diseased tissue • Different tissues, same organism • Same tissue, different organisms • Same tissue, same organism • Time course experiments • We can quantify the difference (effect) by taking a ratio • i.e. for gene k, this is the ratio between expression in state a compared to expression in state b • This provides a relative value of change (e.g. expression has doubled) • If expression level has not changed ratio is 1

A gene is up-regulated in state 2 compared to state 1 if it has a higher value in state 2 A gene is down-regulated in state 2 compared to state 1 if it has a lower value in state 2 Fold change(Fold ratio) • Ratios are troublesome since • Up-regulated & Down-regulated genes treated differently • Genes up-regulated by a factor of 2 have a ratio of 2 • Genes down-regulated by same factor (2) have a ratio of 0.5 • As a result • down regulated genes are compressed between 1 and 0 • up-regulated genes expand between 1 and infinity • Using a logarithmic transform to the base 2 rectifies problem, this is typically known as the fold change

A, B and D are down regulated C is up-regulated E has no change Examples of fold change • You can calculate Fold change between pairs of expression values: • e.g. Between State 1 vs State 2 for gene A • Or Between mean values of all measurements for a gene in the WT/KO experiments • mean(WT1..WT4) vs mean (KO1..KO4)

StatisticsBack to our problems 4 Wild KO samples (Red) Columns represent samples 4 Wild Type samples (Blue) 5000 Rows represent genes

StatisticsSignificance ofFold Change • For our problem we can calculate an average fold ratio for each gene (each row) • This will give us an average effect value for each gene • 2, 1.7, 10, 100, etc • Question which of these values are significant? • Can use a threshold, but what threshold value should we set? • Use statistical techniques based on number of members in each group, type of measurements, etc -> significance testing.

Statistics:5000 separate statistical problems • How do we think about this problem? • Effectively: • 5000 separate experiments where each experiment measures the expression of one gene in two groups of 4 individuals • For each experiment (gene), want to establish if there is a statistical difference between the reported values in each group • We then want to identify those genes (across the 5000 genes) that have a significant change • Each row in our table is similar to one of those of traditional statistical analysis problems

Condition Condition Group 2 members Group 1 members StatisticsUnpaired statistical experiments • Overall setting: 2 groups of 4 individuals each • Group1: Imperial students • Group2: UCL students • Experiment 1: • We measure the height of all students • We want to establish if members of one group are consistently (or on average) taller than members of the other, and if the measured difference is significant • Experiment 2: • We measure the weight of all students • We want to establish if members of one group are consistently (or on average) heavier than the other, and if the measured difference is significant • Experiment 3: • ………

Condition Condition Group 2 members Group 1 members StatisticsUnpaired statistical experiments • In unpaired experiments, you typically have two groups of people that are not related to one another, and measure some property for each member of each group • e.g. you want to test whether a new drug is effective or not, you divide similar patients in two groups: • One groups takes the drug • Another groups takes a placebo • You measure (quantify) effect of both groups some time later • You want to establish whether there is a significant difference between both groups at that later point • The WT/KO example is an unpaired experiment if the rats in the experiments are different !

StatisticsUnpaired statistical experiments • The WT/KO example is an unpaired experiment if the rats in the experiments are different!

StatisticsUnpaired statistical experiments • How do we address the problem? • Compare two sets of results (alternatively calculate mean for each group and compare means) • Graphically: • Scatter Plots • Box plots, etc • Compare Statistically • Use unpaired t-test Are these two series significantly different? Are these two series significantly different?

Condition 1 Condition 2 Group members StatisticsPaired statistical experiments • Overall setting: 1 groups of 4 individuals each • Group1: Imperial students • We make measurements for each student in two situations • Experiment 1: • We measure the height of all students before Bioinformatics course and after Bioinformatics course • We want to establish if Bioinformatics course consistently (or on average) affects students’ heights • Experiment 2: • We measure the weight of all students before Bioinformatics course and after Bioinformatics • We want to establish if Bioinformatics course consistently (or on average) affects students’ weights • Experiment 3: • ………

Condition 1 Condition 2 Group members StatisticsPaired statistical experiments • In paired experiments, you typically have one group of people, you typically measure some property for each member before and after a particular event (so measurement come in pairs of before and after) • e.g. you want to test the effectiveness of a new cream for tanning • You measure the tan in each individual before the cream is applied • You measure the tan in each individual after the cream is applied • You want to establish whether the there is a significant difference between measurements before and after applying the cream for the group as a whole

StatisticsPaired statistical experiments • The WT/KO example is a paired experiment if the rats in the experiments are the same!

StatisticsPaired statistical experiments • How do we address the problem? • Calculate difference for each pair • Compare differences to zero • Alternatively (compare average difference to zero) • Graphically: • Scatter Plot of difference • Box plots, etc • Statistically • Use unpaired t-test Are differences close to Zero?

StatisticsSignificance testing • In both cases (paired and unpaired) you want to establish whether the difference is significant • Significance testing is a statistical term and refers to estimating (numerically) the probability of a measurement occurring by chance. • To do this, you need to review some basic statistics • Normal distributions: mean, standard deviations, etc • Hypothesis Testing • t-distributions • t-tests and p-values

68% of dist. 1 s.d. 1 s.d. x Mean and standard deviation • Mean and standard deviation tell you the basic features of a distribution • mean = average value of all members of the group u = (x1+x2+x3 ….+xN)/N • standard deviation = a measure of how much the values of individual members vary in relation to the mean • The normal distribution is symmetrical about the mean 68% of the normal distribution lies within 1 s.d. of the mean

Note on s.d. calculation • Through the following slides and in the tutorials, I use the following formula for calculating standard deviation • Some people use the unbiased form below (for good reasons) • Please use the simple form if you want the answers to add up at the end

68% of dist. 1 s.d. 1 s.d. x The Normal Distribution Many continuous variables follow a normal distribution, and it plays a special role in the statistical tests we are interested in; • The x-axis represents the values of a particular variable • The y-axis represents the proportion of members of the population that have each value of the variable • The area under the curve represents probability – i.e. area under the curve between two values on the x-axis represents the probability of an individual having a value in that range

Any normal distribution can be transformed to a standard distribution (mean 0, s.d. = 1) using a simple transform 0.025 = p-value: probability of a measurement value belonging to this distribution Normal Distribution and Confidence Intervals a/2 = 0.025 a/2 = 0.025 1-a = 0.95 -1.96 1.96

In unpaired experiments, we compare the difference between the means. Hypothesis Testing: (Unpaired)Are two data sets different • We use z-test (normal distribution) if the standard deviations of two populations from which the data sets came are known (and are the same) • We pose a null hypothesis that the means are equal • We try to refute the hypothesis using the curves to calculate the probability that the null hypothesis is true (both means are equal) • if probability is low (low p) reject the null hypothesis and accept the alternative hypothesis (both means are different) • If probability is high (high p) accept null hypothesis (both means are equal) Ho Population 1 Population 2 Ha Population 1 Population 2 If standard deviation known use z test, else use t-test

Comparing Two SamplesGraphical interpretation • To compare two groups you can compare the mean of one group graphically. • The graphical comparison allows you to visually see the distribution of the two groups. • If the p-value is low, chances are there will be little overlap between the two distributions. If the p-value is not low, there will be a fair amount of overlap between the two groups. • We can set a critical value for the x-axis based on the threshold of p-value

In paired experiments, we compare the mean difference. Hypothesis Testing: (Paired)Are two data sets different • We use z-test (normal distribution) if the standard deviations of two populations from which the data sets came are known • We pose a null hypothesis that the mean difference is zero • We try to refute the hypothesis using the curves to calculate the probability that the null hypothesis is true (mean of difference is 0) • if probability is low (low p) reject the null hypothesis and accept the alternative hypothesis (mean of difference <>0) • If probability is high (high p) accept null hypothesis (mean of difference is 0) Ho Population 1 Population 2 Ha Population 1 Population 2 If standard deviation known use z test, else use t-test

Typically known as Student t-test The t-test • In most cases we use what is know as a t-test rather than the z-test when comparing samples. • In particular when we have • small data sets (less than 30 each) and • we don’t know the s.d. and have to calculate it from the small samples • Same concepts as before apply, but we base the test on what is known as the t-distribution, which approximates the normal distribution for small samples • We have to calculate what is know as a t-value!

We will see how we calculate the degrees of freedom in a short while The t-distribution • In fact we have many t-distributions, each one is calculated in reference to the number of degrees of freedom (d.f.)also know as variables (v) Normal distribution t-distribution

t-test terminology • t-test: Used to compare the mean of a sample to a known number (often 0). • Assumptions: Subjects are randomly drawn from a population and the distribution of the mean being tested is normal. • Test: The hypotheses for a single sample t-test are: • Ho: u = u0 • Ha: u < > u0 • p-value: probability of error in rejecting the hypothesis of no difference between the two groups. (where u0 denotes the hypothesized value to which you are comparing a population mean)

t-Tests terminology Single-tail vs. two-tail • What am I testing for: • Right Tail: (group1 > group2) • Left Tail: (group1 < group2) • Two Tail: Both groups are different but I don’t care how. H0: m 1£ m 2 H1: m 1 > m 2 H0: m 1 - m 2£ 0 H1: m 1 - m 2> 0 Right Tail OR H1: m 1 < m 2 H0: m 1 - m 2³ 0 H1: m 1 - m 2 < 0 OR Left Tail H0: m 1³ m 2 H0: m 1 = m 2H1: m 1¹m 2 H0: m 1 -m 2 = 0 H1: m 1 - m 2 ¹ 0 Two Tail OR

t-test terminologyUnpaired vs. paired t-test • Same as before !! Depends on your experiment • Unpaired t-Test: The hypotheses for the comparison of two independent groups are: • Ho: u1 = u2 (means of the two groups are equal) • Ha: u1 <> u2 (means of the two group are not equal) • Paired t-test: The hypothesis of paired measurements in same individuals • Ho: D = 0 (the difference between the two observations is 0) • Ha: D <> 0 (the difference is not 0)

Where d is calculated by Remember these formulae !! Calculating t-test (t statistic) • First calculate t statistic value and then calculate p value For the paired t-test, t is calculated using the following formula: And n is the number of pairs being tested. • For an unpaired (independent group) t-test, the following formula is used: Where σ(x) is the standard deviation of x andn (x) is the number of elements in x.

Calculating p-value for t-test • When carrying out a test, a P-value can be calculated based on the t-value and the ‘Degrees of freedom’. • There are three methods for calculating P: • One Tailed >: • One Tailed <: • Two Tailed: • Where p(t,v) is looked up from the t-distribution table • The number of degrees (v) of freedom is calculated as: • UnPaired: n(x) +n (y) -2 • Paired: n- 1 (where n is the number of pairs.)

p-values • Results of the t-test: If the p-value associated with the t-test is small (usually set at p < 0.05), there is evidence to reject the null hypothesis in favour of the alternative. • In other words, there is evidence that the mean is significantly different than the hypothesized value. If the p-value associated with the t-test is not small (p > 0.05), there is not enough evidence to reject the null hypothesis, and you conclude that there is evidence that the mean is not different from the hypothesized value.

Calculating t and p values • You will usually use a piece of software to calculate t and P • (Excel provides that !). • In a problems • You can assume access to a function p(t,v) which calculates p for a given t value and v (number of degrees of freedom) • or alternatively have a table indexed by critical t values and v

t-value and p-value • Given a t-value, and degrees of freedom, you can look-up a p-value • Alternatively, if you know what p-value you need (e.g. 0.05) and degrees of freedom you can set the threshold for critical t

Reject H Reject H 0 0 .025 .025 t -2.0154 0 2.0154 t-test Interpretation Note as t increases, p decreases t (value) must > t (critical on table) by P level

A A = .05 = .05 -tc Finding a critical t • The table provides the t values (tc) for which P(tx > tc) = A tc =-1.812 =1.812 t.100 t.05 t.025 t.01 t.005

Summary • Differential analysis • Uses fold ratio (fold change) for measuring effect • Need some measure of significance of such effect. • Statistical analysis • Paired vs. unpaired experiments • t-tests • Calculating t for paired/un-paired experiments • Deciding single tail vs. two-tail • Calculating degrees of freedom • Look-up p value

Introduction to Bioinformatics 5 . Statistical Analysis of Gene Expression Matrices I

Introduction to Bioinformatics 5 . Statistical Analysis of Gene Expression Matrices I

Presentation Transcript

Introduction to Bioinformatics 6 . Statistical Analysis of Gene Expression Matrices II

Analysis of Gene Expression - Overview -

Analysis of Gene Expression Data

GENE EXPRESSION I

Gene Expression I

Serial Analysis of Gene Expression

Gene Expression Analysis

Structured statistical modelling of gene expression data

Introduction to Statistical Analysis of Gene Expression Data

Introduction to Microarray Gene Expression

Bioinformatics for “Gene Expression Analysis in Diagnostic Medicine”

Information Theory, Statistical Measures and Bioinformatics approaches to gene expression

Gene expression analysis

Gene Expression Analysis

Gene Expression Analysis

Lecture 20 Gene expression and the transcriptome I Introduction to Bioinformatics

Statistical analysis of expression data:

Introduction to Gene Expression

I. Control of Gene Expression

Bioinformatics : Gene Expression Data Analysis

Gene Expression Analysis

Gene Expression Analysis Market