160 likes | 169 Views
This overview explores the use of bioinformatics and statistical learning methods to understand correlations between genotype and phenotype, with applications in protein function, drug therapy, and metabolic pathways.
E N D
Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective, Expressive, Interpretable
Motivations • Understanding correlations between genotype and phenotype • Predicting genotype <=> phenotype • Phenotypes: • Protein function • Drug/therapy response • Drug-drug interactions for expression • Drug mechanism • Interacting pathways of metabolism
Projects • Homology detection, protein family classification (funded by a DuPont S&E award) • Support Vector Machines • Hidden Markov models • Graph theoretic methods • Probabilistic modeling for BioSequence (funded by NIH) • HMMs, and beyond • Motifs finding • Secondary structure • Comparative Genomics • Identify genome features for diagnostic and therapeutic purposes (funded by an Army grant) • Evolution of metabolic pathways • Tree and graph comparisons
Detect remote homologues Attributes to be looked at: • Sequence similarity, Aggregate statistics (e.g., protein families), Pattern/motif, and more attributes (presence at phylogenetic tree). How to incorporate domain specific knowledge into the model so a classifier can be more accurate? Results: • Quasi-consensus based comparison of profile HMM for protein sequences (submitted to Bioinformatics) • Using extended phylogenetic profiles and support vector machines for protein family classification (SNPD 04) • Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships (JCB 2003)
Data: phylogenetic profiles • - How toaccount for correlations among profile components? • profile extension (Narra & Liao, SNPD 04) Tree-based distance Hamming distance 0 1 1 1 1 x= = 3 0.1 1 1 1 1 1 y= = 3 0.5 z = 1 1 1 1 0
Quasi consensus based comparison of HMMs V G A - - H A G E Y V - - - - N V D E V V E A - - D V A G H V K G - - - - - - D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V M1 M2 Consensus 1 Consensus 2 V G A N V A E H V K A T I A E H V G A - - N V A E H V K A - - T I A E H V G A - - H A G E Y V - - - - N V D E V V E A - - D V A G H V K G - - - - - - D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V S(c2|M1) S(c1|M2) V - K A - T I A E H V - G A N - V A E H Seed 1 V G A - - H A G E Y V - K A - T I A E H A - G A - H D G E F A G A - - H D G E F V - G A N - V A E H V - G A H - A G E Y Seed 2 Consensus 2 Consensus 1 Seed 2 Seed 1 A - G A - H D G E F V G A - - H A G E Y A G A - - H D G E F V - G A H - A G E Y Aln21 Aln12 From MSA to profile HMMsusing existing packages (SAM-T99 or HMMER) • Generation of quasi consensus • sequence from the model • Alignment of consensus sequence of a • model with the other model • Extraction of two alignments in each • direction
Sequence Models (HMMs and beyond) Motivations: What is responsible for the function? • Patterns/motifs • Secondary structure To capture long range correlations of bio sequences • Transporter proteins • RNA secondary structure Methods: generative versus discriminative • Linear dependent processes • Stochastic grammars • Model equivalence
TMMOD: An improved hidden Markov model for predicting transmembrane topology (to appear in IEEE ICTAI04)
Genomics study of enterobacterial BT agents(funded by the US Army via Center for Biological Defense, USF ) Goals: • Identification of genes and sequence tags as targets for novel diagnosis and therapy • BT agents: Yersinia pestis, Salmonella, Escherichia coli O157:H7) Methods: • Various bioinformatics tools and databases
Comparative Genomics Motivation: • Evolution of metabolic pathways • Gene functions • De novo (alternative pathways) • Genetic engineering • Drug discovery Methods: • Put data into a context: knowledge/data representation • Trees, graphs, etc. • Learning models/methods
P1 P1 Pn O1 1 0 1 O2 0 1 0 Om 1 0 1 Profiling: pairs of attribute-value
What we found: • Informative way to compare genomes • Majority pathways (or rather their enzyme components) evolve in congruence with species
What we do next: • Database and search engine • Off-line self-consistent iteration • Pathways in a network • Graph comparisons • Identify key components of networks • Small world topology • Cross-level interactions with regulatory networks