Bioinformatics Research: Developing Effective Algorithms for Genotype-Phenotype Correlation

Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective, Expressive, Interpretable

Motivations • Understanding correlations between genotype and phenotype • Predicting genotype <=> phenotype • Phenotypes: • Protein function • Drug/therapy response • Drug-drug interactions for expression • Drug mechanism • Interacting pathways of metabolism

Projects • Homology detection, protein family classification (funded by a DuPont S&E award) • Support Vector Machines • Hidden Markov models • Graph theoretic methods • Probabilistic modeling for BioSequence (funded by NIH) • HMMs, and beyond • Motifs finding • Secondary structure • Comparative Genomics • Identify genome features for diagnostic and therapeutic purposes (funded by an Army grant) • Evolution of metabolic pathways • Tree and graph comparisons

Detect remote homologues Attributes to be looked at: • Sequence similarity, Aggregate statistics (e.g., protein families), Pattern/motif, and more attributes (presence at phylogenetic tree). How to incorporate domain specific knowledge into the model so a classifier can be more accurate? Results: • Quasi-consensus based comparison of profile HMM for protein sequences (submitted to Bioinformatics) • Using extended phylogenetic profiles and support vector machines for protein family classification (SNPD 04) • Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships (JCB 2003)

Support Vector Machines

Data: phylogenetic profiles • - How toaccount for correlations among profile components? • profile extension (Narra & Liao, SNPD 04) Tree-based distance Hamming distance 0 1 1 1 1 x= = 3 0.1 1 1 1 1 1 y= = 3 0.5 z = 1 1 1 1 0

Quasi consensus based comparison of HMMs V G A - - H A G E Y V - - - - N V D E V V E A - - D V A G H V K G - - - - - - D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V M1 M2 Consensus 1 Consensus 2 V G A N V A E H V K A T I A E H V G A - - N V A E H V K A - - T I A E H V G A - - H A G E Y V - - - - N V D E V V E A - - D V A G H V K G - - - - - - D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V S(c2|M1) S(c1|M2) V - K A - T I A E H V - G A N - V A E H Seed 1 V G A - - H A G E Y V - K A - T I A E H A - G A - H D G E F A G A - - H D G E F V - G A N - V A E H V - G A H - A G E Y Seed 2 Consensus 2 Consensus 1 Seed 2 Seed 1 A - G A - H D G E F V G A - - H A G E Y A G A - - H D G E F V - G A H - A G E Y Aln21 Aln12 From MSA to profile HMMsusing existing packages (SAM-T99 or HMMER) • Generation of quasi consensus • sequence from the model • Alignment of consensus sequence of a • model with the other model • Extraction of two alignments in each • direction

Sequence Models (HMMs and beyond) Motivations: What is responsible for the function? • Patterns/motifs • Secondary structure To capture long range correlations of bio sequences • Transporter proteins • RNA secondary structure Methods: generative versus discriminative • Linear dependent processes • Stochastic grammars • Model equivalence

TMMOD: An improved hidden Markov model for predicting transmembrane topology (to appear in IEEE ICTAI04)

Genomics study of enterobacterial BT agents(funded by the US Army via Center for Biological Defense, USF ) Goals: • Identification of genes and sequence tags as targets for novel diagnosis and therapy • BT agents: Yersinia pestis, Salmonella, Escherichia coli O157:H7) Methods: • Various bioinformatics tools and databases

Comparative Genomics Motivation: • Evolution of metabolic pathways • Gene functions • De novo (alternative pathways) • Genetic engineering • Drug discovery Methods: • Put data into a context: knowledge/data representation • Trees, graphs, etc. • Learning models/methods

 P1 P1 Pn  O1 1 0 1  O2 0 1 0        Om 1 0 1 Profiling: pairs of attribute-value

What we found: • Informative way to compare genomes • Majority pathways (or rather their enzyme components) evolve in congruence with species

What we do next: • Database and search engine • Off-line self-consistent iteration • Pathways in a network • Graph comparisons • Identify key components of networks • Small world topology • Cross-level interactions with regulatory networks

Bioinformatics Research: Developing Effective Algorithms for Genotype-Phenotype Correlation

Bioinformatics Research: Developing Effective Algorithms for Genotype-Phenotype Correlation

Presentation Transcript

Statistical Bioinformatics

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

A short overview of the Bioinformatics Core

Algorithms in Bioinformatics

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Research Overview

Statistical Bioinformatics

Algorithms in Bioinformatics

Statistical Bioinformatics

Algorithms in Bioinformatics

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms

Haplotyping Algorithms

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics and Computational Biology

Bioinformatics Algorithms and Data Structures

Biostatistics and Statistical Bioinformatics

Bioinformatics Algorithms and Data Structures

Bioinformatics Research Overview