290 likes | 386 Views
Two Classifiers for Bioinformatics. Course: CSI7162 Present to Dr. Stan Matwin Presented by Jun Ouyang. Road Map. Introduction Discovery of regulatory connections in Microarray data (Gene level) Classifying protein fingerprints (Protein Level) Conclusion Research Challenges.
E N D
Two Classifiers for Bioinformatics Course: CSI7162 Present to Dr. Stan Matwin Presented by Jun Ouyang
Road Map • Introduction • Discovery of regulatory connections in Microarray data (Gene level) • Classifying protein fingerprints (Protein Level) • Conclusion • Research Challenges
Introduction to Bioinformatics • An emerging interdisciplinary research area • Interface between biological and computational sciences • Computational management of all kinds of biological information • Research scope of bioinformatics: • DNA mRNA protein protein interactions informational pathways informational networks cells tissues or networks of cells an organism populations ecologies (hierarchical biological information) www.ee.nthu.edu.tw/bschen/files/Bioinformatics.ppt
Introduction to Bioinformatics • DNA level: • DNA sequence alignment; gene prediction; gene evolution;… • RNA level: • Study of gene expression; transcription mechanism; post-transcription modification;… • Protein level: • protein 2D and 3D structure prediction; protein active site prediction; protein-protein interactions; protein-DNA interactions;… • System level: (pathways, networks) • Genome (gene-to-gene interactions) • Ex: use gene chips to study gene regulatory network • Proteome (protein-protein interactions) • Ex: use protein chips to study protein interaction network www.ee.nthu.edu.tw/bschen/files/Bioinformatics.ppt
Discovery of Regulatory Connections in Microarray Data (Gene level) • What is a microarray? • Obtaining microarray data • Definition of regulatory relations between sets of genes (class labels) • Reversible Jump Markov Chain Monte Carlo (RJMCMC) Algorithm to learn dynamic Bayesian network • Use dynamic Bayesian network classifiers to predict regulatory relations • Conclusion (1)
What is a Microarray? • A kind of gene chip used to discover gene function or gene expression patterns • Allow these patterns to be studied in parallel • Example: • Colour indicates the relative abundance of a labeled cDNA, meaning the gene has been activated For example, cDNA from cancerous and healthy cells with different probes (known strands of cDNA) In each location, a known probe (cDNA) is placed with cDNA from a certain sample
Obtaining Microarray Data What are the steps? [1] Choose cell population [or sample for diagnosis] [2] mRNA extraction, purify [3] Fluorescent label cDNA [4] Combine different strands of cDNA on microarray [5] Scan data over time [6]Interpret time series to study gene regulation over time 6 time
Regulatory Relations between Sets of Genes • Goal of this work is to identify which genes regulate each other • We are interested in two types of gene regulation: • Co-regulation • Two genes are perfectlyco-regulated when their relative abundance functions w.r.t timehave the same first-order derivatives • Control-regulation • Two genes are inverselyco-regulated, or control-regulated, when their relative abundance functions w.r.t time have first-order derivatives which are inverses of one another
Microarray Data Representation • From the microarray, we obtain a time series describing the gene interactions • At different moments in time the microarray would show a different colour depending on which gene is active
Discretization and Classifier Construction • We must discretize the time signal in order to facilitate learning with a Markov model • For each point in time, the sample value is set to change, local minimum or local maximum. • These features are used to learn a dynamic Bayesian classifier using a variant of the Reversible Jump Markov Chain Monte Carlo (RJMCMC) technique • The classifier then can identify gene interactions as co-regulatory and controlled-regulatory • Details provided later…
Discretization of Continuous Measurements Re-encoding of data using 2 binary variables for each of the 3 possible values Time series represented as change, local minima, local maxima
p(x) X Monte Carlo Principle • If we take a sample every 1/100th of a second and we measure for 10 minutes, we get 60000 samples per gene • We need a method for reducing the number of samples without destroying the pertinent details • Given a very large set X and a distribution p(x) over it • We draw an i.i.d. set of N samples • We can then approximate the distribution using these samples An Introduction to Markov Chain Monte Carlo, Teg Grenager
Dynamic Bayesian Network Classifiers • A Bayesian network: a statistical model for capturing the direct dependencies between discrete stochastic variables • Train dynamic Bayesian networks to discover relations between sets of genes • The Bayesian networks are trained using a variant of the RJMCMC sampling algorithm Co-regulated Control-regulated time time
: a modif. step function : the span of indeterminacy : the bounding probability Learning with Probabilistic Network Classifiers • Objective function of network classifier D: the learning database G: a DAG structure. X: a set of predictor variables C: a classification variable L(C): a score function D=(C,X): the learning database is separated into X and C Goal: sample models from the above target distribution P(L(C)|D).
Controlled-regulated Experiments Result Co-regulated Testing result of yeast cell-cycle expression dataset
Conclusion (1) • A new approach to discovering supposed regulatory relations between genes • Processing techniques capture dynamic relations between sets of genes • The results obtained from the microarray data are promising
Classifying Protein Fingerprints • Motivation • Task and Data Representation • Data Preprocessing and Classification Methods • Experimentation and Results • Conclusion (2)
Motivation • The need for automated protein fingerprint labeling • Protein fingerprint: group of amino acid motifs used to depict protein families • These fingerprints may be useful in grouping proteins together • Improve on PRECIS (an annotation tool) • This tool performs poorly: 40% error rate • Classifies fingerprints using simple heuristics
Task and Data Representation • Goal: replace PRECIS’s handcrafted heuristics with classification models extracted from data. • Three distinct kinds of fingerprints • Fingerprint itself • Its component motifs (motif is a common sequence of amino acids) • Protein
Task and Data Representation • Fingerprint • Number of motifs (nmt) • Number of proteins (npr) • True positive rate • Partial positive rate • Motif • Motif length (average, std, etc.) • Motif coverage ( average, stdev, etc.) • Motif entropy ( average, stdev, etc.) • Motif entropy ( average, stdev, etc.) • Intermotif distance (average, stdev, etc.)
Task and Data Representation • Protein sequence • SWISS-PROT ID: fraction of proteins with ID • LHS: frac of proteins whose length>=3|4 chars frac of proteins with common first 1|2|3|4 chars in LHS entropy of LHS averaged over first 1|2|3|4chars • RHS: frac of proteins with a common RHS (species) entropy of RHS taken as a unit • CC-belongs: sequence belongs to family • CC-contains: sequence contains domain
Data Preprocessing • Dealt with missing values using a technique based on KNN • Considered several feature selection algorithms • Ranking based on information gain I(X,Y)=H(X)-H(X|Y)= H(Y)-H(Y|X) • Ranking based on mutual information
Classification Methods • This work compares the performance of several machine learning algorithms when combined with a feature selection method • ML Algorithms considered: • Logic-based learning algorithm • Decision trees and rules (J48 and C5.0, etc) • Density-estimation based learners • NBayes, IBL, Lindisc, MLPS, SVM-RBF
Experimentation and Results Best performance Error rates on the full 45-feature set CV: Cross-validation, HO: Holdout
Experimentation and Results Cross-validation and Holdout error rates after feature selection
Conclusion (2) • SVM does not seem to benefit from the feature selection process (feature selection only removed 9 features!) • Using a SVM-RBF learned classifier achieves a 26% improvement in accuracy over PRECIS
Research Challenges • First paper • Validate the new approach on real data sets (only simulated data was used) • Second paper • Correcting data imbalance to increase accuracy • Incorporate available data from other databases
References • M. Egmont-Petersen. W. de Jonge, A. Siebes. "Discovery of regulatory connections in microarray data," In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) 2004:149-160 • Melanie Hilario, Alex Mitchell, Jee-Hyub Kim, Paul Bradley, Terri K. Attwood: Classifying Protein Fingerprints. PKDD 2004: 197-208 • M. Egmont-Petersen. "Discovering possible co-relations and control-regulations between gene pairs in time series microarray data using salient dynamic features", Presented at the working group Bioinformatics, Symposium 2004. • M. Egmont-Petersen. "Feature selection by Markov Chain Monte Carlo Sampling - a Bayesian approach," In Structural, Syntactic, and Statistical Pattern Recognition, Proceedings of the Joint IAPR Workshops SSPR 2004 and SPR 2004, Lecture Notes in Computer Science 3138, Eds. A. Fred et al., pp. 1034-1042, 2004. • Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular Biology of the Cell, Vol. 9, No. 12, pp. 3273-3297, 1998. • Green PJ. “Reversible jump Markov chain Monte Carlo computation and Bayesian model determination,” Biometrika, Vol. 82, No. 4, pp. 711-732, 1995. • An Introduction to Markov Chain Monte Carlo. Teg Grenager. July 1, 2004. Agenda • www.ee.nthu.edu.tw/bschen/files/Bioinformatics.ppt
Q & A Thank you!