990 likes | 1.11k Views
Report on IHMC- CMU-Pitt Research Full Report NRA A2-37143 “Automated Discovery Procedures for Gene Expression and Regulation from Microarray and Serial Analysis of Gene Expression Data”
E N D
Report on IHMC- CMU-Pitt Research Full Report NRA A2-37143 “Automated Discovery Procedures for Gene Expression and Regulation from Microarray and Serial Analysis of Gene Expression Data” NCC 2-1295 “Multi-Domain Network Learning Algorithms of Latent Variable Interpretation and Discovering Genetic Regulation”April 2001 – April 2002 http://www.phil.cmu.edu/projects/genegroup
William Buckles (Ph.D, Professor, Tulane) Tianjiao Chu (Ph.D Student, Logic, Methodology and Computation, CMU) Greg Cooper (M.D. Ph.D Associate Professor, School of Medicine, Pitt David Danks (Ph.D, Research Scientist, IHMC) Clark Glymour (Ph.D, P.I., Senior Resarch Scientist and John Pace Scholar, IHMC; Alumni University Professor, CMU) Dan Handley (M.S. Student, Logic, Methodology and Computation, CMU Subramani Mani (Ph.D Student, Biomedical Informatics, Pitt) Rob O’Doherty (Ph.D ,Assistant Professor, School of Medicine, Pitt) Dave Peters (Ph.D , Human Genetics, Pitt Joseph Ramsey (Ph.D, Research Programmer, CMU) Jaime Robins, (M.D. School of Public Health, Harvard) Raul Saavedra (Ph.D, Student, Computer Science, Tulane) Richard Scheines (Ph.D, Associate Professor, CMU) Nicoleta Servan (Ph.D Student, Statistics, CMU) Ricardo Silva (Ph.D student, Computer Science, CMU) Peter Spirtes (Ph.D, Research Scientist IHMC; Professor, CMU) Larry Wasserman (Ph.D, Professor, CMU) Frank Wimberly (Ph.D, Research Programmer, IHMC) Changwon Yoo (Ph.D Student, Biomedical Informatics, Pitt) Research Team
Two Related Goals • Investigating the prospects for more rapid and accurate determination of genetic regulatory networks using recently developed technologies (microarrays and SAGE) • Investigating the prospects for determining the underlying components of measured phenomena, and the influences such components have on one another
Background on Genetics • Proteins do most of the work in the cell • Cell reproduction, metabolism, and responses to the environment are all controlled by proteins • Each gene is a machine for constructing (approximately) a single protein • The rate at which a gene constructs proteins is influenced by concentrations of regulator proteins
Gene Regulatory Networks • Some genes manufacture proteins which control the rate at which other genes manufacture proteins (either promoting or suppressing) • Hence some genes indirectly (via the proteins they create) regulate other genes, which in turn regulate the operation of the cell • The system by which genes regulate each other is called the genetic regulatory network, and can be represented by a directed graph (which is a special case of a Bayes network)
Measuring Gene Expression Levels • A gene’s “expression level” is an approximate measure of the concentration of mRNA transcripts and an more indirect measure of the rate of synthesis of corresponding proteins. • Recently developed technologies--microarrays and Serial Analysis of Gene Expression, or SAGE--allow thousands of gene expression levels to be measured simultaneously • The kinds of measurement errors that these technologies introduce is not well understood • The best way to use these tools to discover gene regulatory networks is not known
Relevance to NASA • Gene expression in microgravity has been shown to differ significantly from expression in Earth gravity • Understanding gene regulation in plants, animals and humans is likely to be important for long term extraterrestrial habitation • Determining regulatory structure is a present laborious, slow and costly • Need for systematic study of the reliability and accuracy of scores of proposals for applying statistical/machine learning procedures to speed up the process
Background on Latent Structure Analysis • Measurements are often of effects of other scientifically interesting variables not directly mesured. • Number and identity of underlying causal or compositional variables may not be entirely known. • Measured effects can influence other measured effects (e.g., through between channel signal leakage in multi-channel
Background on Latent Structure Analysis • With no prior cluster information and with the possibility of measured-measured and latent-latent influences, none of the standard data analysis procedures (e.g., factor analysis, principal components, independent components) give reliable (i.e., asymptotically correct) information about all of: • Number of latent variables • Clustering of measured • Causal or compositional relations among latent variables
Relevance to NASA • NASA collects vast quantities of observational data on the Earth, the solar system and the cosmos, much of it spectral • Need for automated, fast, reliable procedures extracting relevant causal information from diverse datasets — procedures that integrate expert knowledge • Inadequacy of current methods (model specific, clustering algorithms) for this task • Principled procedures using Bayes network methods offer promising alternatives • They have succeeded in other spectral applications • (J. Ramsey, et al., “Automated Identification of Carbonate Composition from Reflectance Spectra,” Data Mining and Knowledge Discovery, in press.)
Structure of the Projects • Statistical Foundations • Multiple testing problem • Measurement error models • Search Algorithms • Different kinds of inputs • Different assumptions about background knowledge • Experiments • Microarray • SAGE • Testing • Application to known genetic regulatory networks • Application to simulated data
First Year Results: Algorithms • Many algorithms for inferring causal networks that have been applied to inferring gene regulatory networks assume the input is associations between measured features of individuals • But microarrays and SAGE measure average gene expression levels over many cells rather than for a single cell • What is the feasibility of inferring regulatory networks from associations between averages? • Feasibility for linear and local-linear regulatory functions • Impossibility for the mathematical form of the regulatory function of sea urchin Endo 16 gene, one of the best established. • T. Chu, C. Glymour, R. Scheines and P. Spirtes, “A Statistical Problem for Inference to Regulatory Structure form Associations of Gene Expression Measurements with Microarrays” Bioinformatics, submitted.
First Year Results: Statistics • Current methods for determining from SAGE measurements which genes are changing in response to experimental manipulations are incorrect • Correct method requires estimating additional experimental parameters, and leads to the conclusion that many fewer genes are changing than had been previously thought • T. Chu, “Computation of Variance in SAGE Measurements of Gene Expression” Technical Report, Logic, Methodology and Computation, 2002. • Future plan – apply the new method to SAGE measurements of the response of genes to shear stress (data already gathered)
First Year Results: Statistics • Standard techniques for testing whether a gene expression level has changed due to an experimental manipulation were not designed to be applied to test thousands of genes simultaneously • Recent developments (False Discovery Rate tests) do allow simultaneous testing of thousands of genes • Further improvements of the False Discovery Rate procedure have been made • C. Genovese, and L. Wasserman, “Bayesian and Frequentist Multiple Testing”, CMU Department of Statistics Technical Report 764, April, 2002.
First Year Results: Algorithms • Implementation and testing (on simulated data) of a correct (under explicit assumptions) algorithm for causal clustering and for determining latent structure • R. Silva, CMU Master’s Thesis, Center for Automated Learning and Discovery • Extension to time series of learning algorithms for dynamical Bayes Nets • D. Danks, “Constraint-Based Learning Algorithm for Dynamical Bayes Nets, Conference on Uncertainty in Artificial Intelligence,” submitted. • Development and proof of correctness for an improved algorithm for inferring Bayes networks across distinct data sets with overlapping variable sets • D. Danks, “Efficient Learning of Bayes Nets from Databases with Overlapping Variables,” IHMC Technical Report, 2002.
First Year Results: Algorithms • Development and testing of algorithms for maximizing information obtained from “knockout” experiments • R. Silva, C. Glymour, D. Danks, “Inferring Genetic Regulatory Structure from First and Second Moments,” Technical Report, Logic, Methodology and Computation, 2002. • Development, implementation and testing of a genetic algorithm for linear Bayes networks (structural equation models) • S. Harwood and R. Scheines, “Learning Linear Causal Structure Equation Models with Genetic Algorithms” (2001) Tech Report CMU-PHIL-128, submitted to Conference on Knowledge Discovery and Data Mining. • S. Harwood and R. Scheines, “Genetic Algorithm Search over Causal Models” (2001) Tech Report CMU-PHIL-131, submitted to Conference on Uncertainty in Artificial Intelligence. • Development of an algorithm for regulatory structure from mixed observational and knockout data
First Year Results: Testing • Very few genetic regulatory networks are known, and even fewer details about the functional relationships among the genes are known • How can the accuracy of a causal discovery algorithm be tested? • Generate simulated data from made up gene regulatory networks, so that the generating mechanism is known
First Year Results: Testing • Implementation of a flexible program for generating simulated microarray data that allows the user to conveniently specify many different • Functional relationships between cells • Measurement errors • Averaging over different numbers of cells • Gene regulatory network structures (including varying time lags) • J. Ramsey and R. Scheines, (2001) “Simulating Genetic Regulatory Networks,” Technical Report CMU-PHIL-124. • Implementation of half a dozen algorithms proposed in the literature for inferring regulatory structure from expression associations in microarray measurements (more to be implemented)
First Year Results: Experiments • Fat cells from mice are treated with troglitazone, which increases the efficiency of the biological actions of insulin in diabetes and obesity • Which genes are activated? • Microarray chips used to make 47 measurements of gene expression level at 35 time points for 5355 genes
First Year Results: Experiments • Normalize data to remove chip-to-chip effects • Perform statistical tests to determine which genes are changing, adjusting for multiple tests Comparing 20 genes that change most with 20 that change least
Current Work: Experiments • Remove outlying genes • Improve the test performed for whether a gene is changing over time • Introduce clustering methods for data • Use slower but more accurate measurement techniques (Northern Blots) to • Test the hypotheses about which genes change according to the microarray analysis • Learn about errors in measurement when using microarrays
Gene Research Plans: May 2002 – May 2003 • Study statistical properties of multiple decisions and of conditional independence among averaged variables Develop new algorithms for optimal information extraction and implement algorithms proposed in the literature Implement Simulator Laboratory SAGE and microarray study of expression under varying surface flows and drug treatments Where we are Test algorithms on real and simulated data Analyze data Make Predictions Where we will be Knockout Experiments Overall Evaluation
Latent Structure Research Plans, 2002-2003 • Improve efficiency • Test on large simulated data sets • Prove asymptotic correctness • Investigate non-linear generalizations
Supplementary Material – Outline • Discovering the Structure of Genetic Regulatory Networks • Testing Algorithms – Simulator • Analysis of Gene Expression Levels Averaged Over Many Cells • Analysis of SAGE Data • Latent Structure---Causal Clustering • Experiments • Experiment 1 – Microarray analysis • Experiment 2 – SAGE analysis
Simplified Gene Regulatory Network Environment G1 G2 G3 G4 mRNA1 mRNA2 mRNA3 mRNA4 protein1 protein2 protein3 protein4 G5 G6 mRNA5 mRNA6 protein5 protein6
Environment G1 G2 G3 G4 G5 G6 Still More Simplified
Two Strategies for Discovering Gene Regulatory Networks • (Difference) Enhance or suppress specific genes and measure the changes in expression levels of other genes. Infer effects of manipulated gene from differences in expression levels of other genes versus unmanipuated controls • (Association). Use wild-type cells or cells with specific enhanced or suppressed levels of other genes. Infer effects from associations of expression levels of all genes
Measurement Techniques • Microarray techniques allow measurements of relative mRNA concentrations from multiple tissue sources • mRNA concentrations for thousands of genes can be measured simultaneously • Measurements can be taken in time sequence, every few minutes • Serial Analysis of Gene Expression (SAGE) allows estimation of concentrations of mRNA transcripts for essentially the entire genome—does not require prior knowledge of all genes
Difference Method • Several examples of partial identification of part of the regulatory network for several species • Limitations: • Laborious and expensive • Each experiment can only tell us which genes are regulated by a manipulated gene, nothing about the pathway of regulation • E.g, If gene A is suppressed and genes B and C change in consequence, the experiment does not distinguish among: A B C A C B C A B
Difference Method - Fundamental Problems • How to make optimal multiple statistical decisions about expression differences • How to efficiently extract all information from an experiment • How to dynamically schedule experiments for maximal information
Association Method • An example or two of recovery of regulatory structure previously established by Difference methods. No novel discoveries so far. • Requires larger number of experimental repetitions • Depends on statistical methods for implicitly or explicitly estimating conditional probability relations among cellular expression levels
Simulator • User specifies • Functional relationships between cells • Measurement errors • Averaging over different numbers of cells • Gene regulatory network structures (including varying time lags) • Type of experiment • This provides a known structure to test algorithms on, under a variety of assumptions about how genes are related
Simulating MicroArray Data Tetrad 4 (www.phil.cmu.edu/projects/tetrad) Network structure Functional form Parameters
Data Output Cell by Cell: Raw data Aggregrated Measurements
Simulating MicroArray Data • Simulated correlation between genes 1 and 3, using different sizes averaged over (10, 100, and 1,000 cells/dish) over 450 time steps
Averaging and Association • Goal is to discover the structure of a regulatory network from associations among expression levels of each pair of genes, and their associations conditional on values of other genes • But we measure only concentrations—averages—formed from the mRNA of many cells • For many systems, conditional associations are altered by averaging
The Endo 16 Regulatory Function • Regulation of the Endo16 gene of the sea urchin (from C. Yuh, H. Bolouri, E. Davidson “Genomic Cis-Regulatory Logic: Experimental and Computational Analysis of a Sea Urchin Gene” Science, 1998, March 20; 279: 1896-1902
The Endo 16 Regulatory Function, Slightly More Algebraically If ( CG1 * P) (B(t) + G(t)) > 0, then Q (t) = 2 (1 – (F + E + CD) Z) (1 + CG2 * CG3 * CG4) (CG1 * P) (B(t) + G(t)) Else Q (t) = 2 (1 – (F + E + CD) Z) ( 1 + CG2 * CG3 * CG4)Otx(t) and “ + “ is Boolean sun
Conditional Independence Is Not Invariant in a Simplified Form of Endo 16 Regulation • X takes values in a discrete set, say {0,1,2,3,4} • Y = g(X), g nonlinear, say Y = X2 • Z = a Y*W, a real, W Boolean (values in {0.1}, with a Bernoulli distribution X Y Z W
Conditional Independence Is Not Invariant in a Simplified Form of Endo 16 Regulation • X is independent of Z conditional on Y, but…. • S X is not independent of S Y conditional on S Z, where the sum is over values in n = 4 or more identically and independently distributed units • For large n this result generalizes to all cases in which the range of X is finite (but not binary), g is polynomial, and W is as above
General Pessimistic Conclusion (not a Theorem) • Conditional probability relations that hold among regulator and regulated gene transcript concentrations at the cellular level will not be preserved in probability relations as measured in microarrays taking from multiple cell sources • They will be preserved for linear systems and “locally linear” systems (see Chu, et al.), but no regulatory systems are as yet known to have such a structure
Difference Strategy and SAGE • Estimating whether expression levels of genes change in different environments, or which other genes removed, requires a comparison of expression levels across samples • Decision must be made as to whether observed differences are or are not due to chance
SAGE and Variance • Decisions as to whether differences expression levels are or are not due to chance depend on the estimate of the variance of the underlying probability distribution • Standardly, a multinomial model is used which gives a very large variance—meaning decisions about the constancy of a gene’s expression across environments cannot be reliably made