300 likes | 400 Views
Combining heterogeneous data to reverse engineer regulatory networks. Allan Tucker School of Information Systems Computing and Mathematics, Brunel University, London. UB8 3PH. Intelligent Data Analysis. IDA attempts to deal with data explosion to discover patterns and knowledge from data
E N D
Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University, London. UB8 3PH
Intelligent Data Analysis • IDA attempts to deal with data explosion to discover patterns and knowledge from data • Typical analysis tasks: • Clustering • Classification • Feature Selection • Prediction and Forecasting • Structure identification
Bayesian Networks • An IDA method to model a domain using probabilities • Easily interpreted by non-statisticians • Can be used to combine existing knowledge with data • Essentially use independence assumptions to model the joint distribution of a domain
Informative Priors • To build BNs we can also use prior structures and probabilities • These are then updated with data • Usually uniform (equal probability) • Informative Priors used to incorporate existing knowledge into BNs
Microarray Data • Major source of data for gene expression activity • Technology takes measurements over 1000s of genes simultaneously • Gene Regulatory Networks (GRNs) model how genes interact • Eliciting reliable GRNs from data key to understanding biological mechanisms
But... • Reliability issues that surround microarray gene expression data • Mechanisms in different systems & species • Can we build GRN models that have enhanced performance, based on a richer and/or broader collection of data than a single microarray dataset?
The talk • Incorporating literature priors • Consensus networks • Models of Increasing Complexity • Interspecies analysis
Literature-based priors • Information about biomedical concepts such as genes summarized using concept profiling (Jelier et al., 2007; Schuemie et al., 2007a) • Combine information from several databases, including Entrez Gene, Uniprot, and the Saccharomyces Genome Database • Concept profile is a vector of concepts with weights • Weight represents uncertainty between occurrence of one concept and another (2009) Steele, E., Tucker, A., 't Hoen, P.A.C. and Schuemie, M.J., Literature-Based Priors for Gene Regulatory Networks, Bioinformatics 25 (14) : 1768-1774
Literature-based priors • Perform Pearson correlation on concpet profiles of genes to create a literature matrix • Translate correlations into probs using confidence scores. Represents prob that a particular correlation was not drawn from the distribution of random gene-pair correlations • Not equal to probability that edge exists – see Segal et al. (2002) and Efron (2007) • Incorporate as a prior into BIC score: • BIC = w log P(S) + log P(S|D) - 0.5 k log(n)
The Experiments • Test our approach on synthetic networks generated using differential equations, yeast studies and e coli studies with known regulatory structures • Report on ROC analysis: • True Positives: links that are correctly id • False positives: links that are incorrectly id • False Negatives: links that are missed • True Negatives: links that are correctly missed • Also predictive power using CV
Yeast and E-Coli Network Analysis • Issues with circularity when validating
Literature Priors Conclusions • A literature prior weight of between 0.4 and 0.6 appears best choice to identify relevant regulatory edges on human data for mechanisms involving Muscular Dystrophy • Higher prior weights lead to inclusion of too many edges (literature associations not of regulatory nature) • A lower weight than the optimum prior weights found for yeast and E. coli • Perhaps because less literature on the human organism whereas yeast and E. coli are both well-studied.
Consensus Bayesian Networks • Different platforms involve different biases: • e.g. Oligonucleotide estimates of absolute value of expression whereas cDNA measures relative differences between genes. • Previous research established comparing datasets using standard normalisation is difficult and not straightforward • An attempt to combine multiple microarray data sources through post-learning aggregation Steele, E. Tucker A. “Consensus and Meta-analysis regulatory networks for combining multiple microarray gene expression datasets”, Journal of Biomedical Informatics 41(6), pp 914-926 , 2008
Consensus Bayesian Networks • Bootstrapping on each dataset to generate robust networks with confidence • Threshold the confidence and generate a PDAG (due to equivalence classes) • Consensus looks for edges with enough support in the input networks • Edge direction is based upon voting of inputs – or left undirected if there is no consensus or if cycles cannot be resolved
Weighting networks Steele, E. and Tucker, A., Selecting and Weighting Data for Building Consensus Gene Regulatory Networks, Advances in Intelligent Data Analysis VIII: 8th International Symposium on Intelligent Data Analysis (IDA 2009). Lecture Notes in Computer Science, volume 5772: 190-201, 2009
c) Models of Increasing Complexity Specification of three muscle differentiation datasets (2010) Anvar, S.Y., t' Hoen, P.A.C. and Tucker, A., The Identification of Informative Genes from Multiple Datasets with Increasing Complexity, BMC Bioinformatics 11 : 32
MIC • Select one dataset for training • Others become test sets • Score mean and variance of SSE using CV and indpt test sets • Use these to rank genes
MIC - Datasets • All concerned with the differentiation of cells into the muscle (Myogenic) lineage • In-vitro system mimics the formation of new muscle fibres in-vivo • Cao uses embryonic fibroblasts, others use tumor cell line that has the potential for differentiation into different lineages (mainly muscle and bone) • Cao use MyoD and MyoG to force cell differentiation (others use serum starvation) • Sartorelli includes different treatments that affect timing and efficiency
MIC Select genes using one dataset (black) at a time and compare average CV error rate of BN classifier learnt on same dataset and validated on the other two datasets independently (grey). Cao does well on CV but overfits Tomzczak does well on both
MIC • Select 100 informative (KS test), and 50 uninformative genes. • Train BN classifier on Tomczak and test on Sartorelli. • Rank genes according to average error rate. • Score average improvement or deterioration of Myogenesis-Related, Top 100 and 50 random selected genes in Sartorelli • Compare our method with • rankings generated by • concordance model.
MIC Conclusions • Highly predictive and consistent genes from pool of • differentially expressed genes, across independent • datasets are more likely to be fundamentally involved • in the biological process under study • Results imply that gene regulatory networks identified • in simpler systems can be used to model more complex • biological systems
MIC Conclusions • e.g. muscle differentiation: myogenesis-related network • is difficult to derive from in vivo experiments due to • presence of multiple cell types and higher biological variation • But may become evident after initial training of the • network on the cleaner in vitro experiments
Summary • Explored a number of novel techniques for buidling more • Reliable GRNS • Incorporating exogenous knowledge in the form of BN • Priors constructed from biological abstracts • Consensus algorithms for post-learning aggregation of • data / networks • Models of increasing complexity for identifying genes that • are more confidently associated with a biological process • Future work – extending MIC to inter-organism mechanisms
Thanks Dr Emma Steele, previously Brunel Mr Yahya Anvar & Dr Peter-Bram ‘t Hoen, Leiden University Medical School, Netherlands