Bioinformatics Research Overview: Developing Algorithms for Biological Problem Solving

Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods that help solve biological problems > Capable of incorporating domain knowledge > Effective, Expressive, Interpretable Li Liao, SIG NewGrad, 09/29/2008

Motivations • Understanding correlations between genotype and phenotype • Predicting genotype <=> phenotype • Some Phenotype examples: • Protein function • Drug/therapy response • Drug-drug interactions for expression • Drug mechanism • Interacting pathways of metabolism Li Liao, SIG NewGrad, 09/29/2008

Bioinformatics in a … cell Li Liao, SIG NewGrad, 09/29/2008

Li Liao, SIG NewGrad, 09/29/2008 Credit:Kellis & Indyk

Projects • Genome sequencing and assembly (funded by NSF) • Homology detection, protein family classification (funded by a DuPont S&E award) • Support Vector Machines • Hidden Markov models • Graph theoretic methods • Probabilistic modeling for BioSequence (funded by NIH) • HMMs, and beyond • Motifs finding • Secondary structure • Systems Bioinformatics Prediction of Protein-Protein Interactions Inference of Gene Regulatory Networks Prediction of other regulatory elements Pattern analysis for RNAi (funded by UDRF) • Comparative Genomics • Identify genome features for diagnostic and therapeutic purposes (funded by an Army grant) Li Liao, SIG NewGrad, 09/29/2008

People Current members: • Dr. Wen-Zhong Wang (Postdoc Fellow) • Roger Craig (PhD student) • Alvaro Gonzalez (PhD student) • Kevin McCormick (PhD student) • Colin Kern (PhD student) Past members: • Robel Kahsay (Ph.D. currently at DuPont Central Research & Development) • Kishore Narra (M.S. currently at VistaPrint, Inc.) • Arpita Gandhi (M.S. currently at Colgate-Palmolive Company) • Gaurav Jain (M.S. currently at Institute of Genomics, Univ. of Maryland) • Shivakundan Singh Tej (M.S.) • Tapan Patel (B.S. currently in MD/PhD program at U Penn) • Laura Shankman (B.S., currently in PhD program at U Virginia) Li Liao, SIG NewGrad, 09/29/2008

Li Liao, SIG NewGrad, 09/29/2008

Hybrid Hierarchical Assembly • Three types of reads: Sanger (~1000bp), 454 (~100bp), and SBS (~30bp). • Assembly of individual types using the best suited assemblers. • Phrap, TIGR, etc. for Sanger reads • Euler assembler and Newbler for 454 reads • Euler short, Shorty for SBS reads • Hybrid and hierarchical • Use longer reads as scaffolding to resolve repeat regions that are difficult for shorter reads • Use contigs from shorter reads (pyrosequencing) as pseudoreads to bridge gaps (nonclonable and hard stops) with Sanger reads. Li Liao, SIG NewGrad, 09/29/2008

Major Findings • Hybrid hierarchical assembly is proved to be an effective way for assembling short reads • Incremental approach to selecting ABI reads is more effective than random approach in generating high coverage contigs • Staged assembly using Phrap is an effective alternative to the proprietary Newbler assembler. Publications: Gonzalez & Liao, BMC Bioinformatics 2008, 9:102. Li Liao, SIG NewGrad, 09/29/2008

Blue lines are contigs generated from hybrid assembly Li Liao, SIG NewGrad, 09/29/2008

Detect remote homologues Attributes: • Sequence similarity, Aggregate statistics (e.g., protein families), Pattern/motif, and more attributes (presence at phylogenetic tree). How to incorporate domain specific knowledge into the model so a classifier can be more accurate? Results: • Quasi-consensus based comparison of profile HMM for protein sequences (Kahsay et al, Bioinformatics 2005) • Using extended phylogenetic profiles and support vector machines for protein family classification (Narra & Liao, SNPD04, Craig & Liao, ICMLA’05, Craig & Liao SAC’06, Craig & Liao, Int’l J. Bioinfo & DM 2007) • Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships (JCB 2003) Li Liao, SIG NewGrad, 09/29/2008

Non-linear mapping to a feature space Φ() xi Φ(xi) Φ(xj) xj L() =  i  ½  i jyi yj Φ (xi )·Φ (xj ) Li Liao, SIG NewGrad, 09/29/2008

Data: phylogenetic profiles • - How toaccount for correlations among profile components? • profile extension (Narra & Liao, SNPD 04) • Transductive learning (Craig & Liao, ICMLA’05, SAC’06, IJBDM, 2007) Tree-based distance Hamming distance 0 1 1 1 1 x= = 3 0.1 1 1 1 1 1 y= = 3 0.5 z = 1 1 1 1 0 Li Liao, SIG NewGrad, 09/29/2008

0.55 0.34 Post-order traversal 0.75 0.67 1 0.33 0.5 1 0.33 0.67 0.34 0.5 0.75 0.55 1 1 0 1 0 0 0 1 1 Li Liao, SIG NewGrad, 09/29/2008

Sequence Models (HMMs and beyond) Motivations: What is responsible for the function? • Patterns/motifs • Secondary structure To capture long range correlations of bio sequences • Transporter proteins • RNA secondary structure Methods: generative versus discriminative • Linear dependent processes • Stochastic grammars • Model equivalence Li Liao, SIG NewGrad, 09/29/2008

TMMOD: An improved hidden Markov model for predicting transmembrane topology (Kahsay, Gao & Liao. Bioinformatics 2005) Li Liao, SIG NewGrad, 09/29/2008

Inferring Regulatory Networks from Time Course Expression Data (Gandhi, Cogburn & Liao, 2008) Expression Profile Clustering K-mean Binary heat map Boolean network algorithm Li Liao, SIG NewGrad, 09/29/2008

Bioinformatics Research Overview: Developing Algorithms for Biological Problem Solving