Data Analysis for Gene Chip Data Part I: One-gene-at-a-time methods Min-Te Chao 2002/10/28

Data Analysis for Gene Chip DataPart I: One-gene-at-a-time methods Min-Te Chao 2002/10/28

Outline • Simple description of gene chip data • Earlier works • Mutiple t-test and SAM • Lee’s ANOVA • Wong’s factor models • Efron’s empirical Bayes

Remarks • Most works are statistical analysis, not really machine learning type • Very small set of training sample – not to mention the test sample • Medical research needs scientific rigor when we can

Arthritis and Rheumatism • Guidelines for the submission and reviews of reports involving microarray technology v.46, no. 4, 859-861

Reproducibility • Should document the accuracy and precision of data, including run-to-run variability of each gene • No arbitrary setting of threshold (e.g., 2-fold) • Careful evaluation of false discovery rate

Statistical Analysis • Statistical analysis is absolutely necessary to support claims of an increase or decrease of gene expression • Such rigor requires multiple experiments and analysis of standard statistical instruments.

Sample Heterogenenity • … Strongly recommends that investigators focus studies on homogenous cell populations until other methodological and data analysis problems can be resolved.

Independent Confirmation • It is important that the findings be confirmed using an independent method, preferably with separate samples rather than restating of the original mRNA.

Microarray • Other terms: DNA array DNA chips biochips Gene chips

The underlying principle is the same for all microarrays, no matter how they are made • Gene function is the key element researchers want to extract from the sequence • DNA array is one of the most important tools (Nature, v.416, April 2002 885-891)

2 types of microarray • cDNA • Oligonucleotides • DIY type

Microarray allows the researchers to determine which genes are being expressed in a given cell type at a particular time and under particular condition Gene-expression

Basic data form • On each array, there are p “spots” (p>1000, sometimes 20000). Each spot has k probes (k=20 or so). There are usually 2k measurements (expressions) per spot, and the k differences, or the difference of logs, are used. • Sometimes they only give you a summary statistics, e.g. median, mean,.. per spot

Each spot corresponding to a “gene” • For each study, we can arrange the chips so that the i-th spot represents the i-th gene. (genes close in index may not be close physically at all) • This means that when we read the i-th spot of all chips in one study, we know we get different measurements of the same i-th gene

Data of one chip can be arranged in a matrix form, Y; X_1, X_2, …, X_p Just as in a regression setup. But in practice, n (chips used) is small compared with p. Y is the response: cell type, experimental condition, survival time, …

For a spot with 20 probes, see Efron et al. (2001, JASA, p.1153).

Earlier works • Cluster analysis • Fold methods • Multiple t with Bonferroni correction

Multiple t with Bonferroni correction • It is too conservative • Family wise error rate Among G tests, the probability of at least one false reject – basically goes to 1 with exponential rate in G

Sidak’s single-step adjusted p-value p’=1-(1-p)^G Bonferroni’s single-step adjusted p-value p’=min{Gp,1} All are very conservative

FDR –false discovery rate • Roughly: Among all rejected cases, how many are rejected wrong? (Benjamini and Hochberg 1995 JRSSB, 289-300) “Sequential p-method”

Sequential p-method • Using the observed data, it estimates the rejection regions so that the FDR < alpha Order all p-values, from small to large, and obtain a k so the first k hypotheses (wrt the smallest k p-values) are rejected.

Since we have a different definition for error to control, it will increase the “power” • For modifications, see Storey (2002, JRSSB, 479-498) • These are criteria specifically designed to handle risk assessment when G is large

Role of permutation • For tests (multiple or not), it is important to use a null distribution • It is generated by a well-designed permutation (of the columns of the data matrix) –column refers to observations, not genes.

One simple example • Let us say we look at the first gene, with n_1 arrays for treatment and n_2 arrays for control • We use a t-statistics, t_1, say. What is the p-value corresponding to this observed t_1?

Permute the n=n_+n_2 columns of data of the data matrix. Look at first row (corresponds to the first gene) • Treat the first n_1 numbers as a fake “treatment”, the last n_2 numbers as a fake “control” , compute a t-value, say we get s_1

Permute again and do the same thing and we get s_2, …. • Do it B times and get s_1, s_2, …., s_B • Treat these s’s as a (bootstrap) sample for the null distribution of the t_1 statistic • The p-value of the earlier t_1 is found from the ecdf of the s_j, j=1,2,…,B

Permutation plays a major role --- finding a reference measure of variation in various situations • For a well designed experiment with microarray, DOE techniques will play an important role in determining how to do proper permutations.

SAM– significance analysis of microarray • A standard method of microarray analysis, taught many times in Stanford short courses of data mining • Modified multiple t-tests • Using the permutation of certain data columns to evaluate variation of data in each gene

Original paper is hard to read: (Tusher, Tibshirani and Chu, PNAS 2001, v.98, no.9, 5116-5121) But the SAM manual is a lot easier to read for statisticians: (free software for academia use)

D(i)={X_treatment – X_control} over {s(i)+s_0} i=1,2,…,G D(1)<D(2)<….. Used in SAM, s_0 is a carefully determined constant >0.

D(i)* are used with certain group of permutations of the columns; D(i)* are also ordered • Plot D vs. D*, points outside the 45-degree line by a threshold Delta are signals of significant expression change. • Control the value of Delta to get different FDR.

Other model-based methods • Wong’s model PM-MM= \theta \phi + \epsilon Outlier detection Model validation Li and Wong (2001, PNAS v.98, no.1, 31-36)

Lee’s work • ANOVA based • May do unbalanced data – e.g., 7 microarray chips (Lee et al. 2000, PNAS, v.97, 9834-9839)

Empirical Bayes • (Efron et al. (2001) JASA, v.96, 1151-1160) • Use a mix model f(z)=p_0 f_0(z)+p_1 f_1(z) with f_0, f_1 estimated by data. p_1=prior prob that a gene expression is affected (by a treatment)

A key idea is to use permuted (columns) data to estimate f_0 • Use a tricky logistic regression method • Eventually found p_1(Z)= the a posteriori probability that a gene at expression level Z is affected

Part I conclusion • Earlier methods are relatively easy to understand, but to get familiar with the bio-language needs time • More powerful data analytic methods will continue to develop • It is important to first understand the basic problems of biologist before we jump with the fancy stat methods

We may do the wrong problem … • But if the problem is relevant, even simple methods can get good recognition • All methods so far are “first moment only” – ie, not too much different from multiple t tests; or, they all are one-gene-at-a-time methods.

We did not address issues about data cleaning, outlier detection, normalization, etc. Microarray data are highly noisy, these problems are by no means trivial. • As the cost per chip goes down, the number of chips per problem may grow. But still well-designed experiments, e.g., fractional factorial, has room to play in this game

Statistical methods, as compared with machine learn based methods, will play a more important role for this type of data since, with a model, parametric or not, one can attach a measure of confidence to the claimed result. This is crucial for scientific development.

Quote: • The statistical literature for microarrays, still in its infancy and with much of it unpublished, has tended to focus on frequentist data-analytical devices, such as cluster analysis, bootstrapping and linear models. (Efron, B. 2001)

Data Analysis for Gene Chip Data Part I: One-gene-at-a-time methods Min-Te Chao 2002/10/28