The Bioinformatics Toolkit at the MPI for Developmental Biology

The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution (Andrei Lupas) Max-Planck-Institute for Developmental Biology

? Our toolkit assists the department’s research in protein evolution … … and makes methods developed in our group accessible to a larger public Sequence similarity searches Multiple sequence alignment Sequence analysis (repeats, periodicities, subtyping) Secondary structure and transmembrane prediction Tertiary structure prediction and structure analysis Phylogeny and classification Utilities (reformatting, sequence retrieval, filtering)

Overview page for Sequence Search toolkit

PSI-BLAST has enhanced functionality over NCBI • Select subsets out of >300 genomes • Upload personal databases • Change databases between search rounds • Show colored multiple alignment (JalView) • Submit results to other tools 57636

Quick2D integrates results of various 2’ndary structure prediction programs Contributed by Christian Mayer, MPI-DevBio 68748

REPPER detects periodic regions in proteins Gruber M, Söding J, and Lupas AN. (2005) NAR 33, W239-243. 92259

Several tools rely on a sensitive new method for remote homology detection HHrep De-novo repeat detection HHpred Structure and function prediction by detecting remote homologs in databases such as the PDB, SCOP, Pfam, Smart, InterPro, CDD at NCBI,… HHsenser Sequence search method that employs exhaustive intermediate profile search` Underlying method: Pairwise comparison of profile hidden Markov models (HMMs) What is a sequence profile? What is a profile HMM?

A ... 0 0 0 0 0.25 0.75 0 0.2 0.4 0 ... C ... 0 0 0 0 0 0 0 0 0 0 ... D ... 0 0 0.2 0 0 0 0.2 0 0.2 0 ... E ... 0 0.2 0.2 0 0.25 0 0 0 0 0.4 ... F ... 0 0 0 0.2 0 0 0 0 0 0 ... G ... 0 0.6 0 0 0.25 0.25 0 0.2 0.2 0.4 ... H ... 0 0 0 0 0 0 0.2 0 0 0 ... I ... 0 0 0 0.2 0 0 0 0.2 0 0 ... K ... 0 0.2 0.6 0 0 0 0 0 0 0.2 ... L ... 0 0 0 0 0 0 0 0 0 0 ... M ... 0 0 0 0 0 0 0 0 0 0 ... N ... 0 0 0 0 0.25 0 0.6 0 0 0 ... P ... 0 0 0 0 0 0 0 0 0.2 0 ... Q ... 0 0 0 0 0 0 0 0 0 0 ... R ... 0 0 0 0 0 0 0 0 0 0 ... S ... 0 0 0 0 0 0 0 0 0 0 ... T ... 0 0 0 0 0 0 0 0 0 0 ... V ... 0 0 0 0.6 0 0 0 0.4 0 0 ... W ... 1.0 0 0 0 0 0 0 0 0 0 ... Y ... 0 0 0 0 0 0 0 0 0 0 ... Sequence profiles are a condensed representation of multiple alignments master sequence HBA_human ... W G K V GA - - H AG E ... HBB_human ... W G K V - - - - N V D E ... MYG_phyca ... W G K V E A - - D V AG ... LGB2_luplu ... W K D F N A - - N I P K ... GLB1_glydi ... W E E I AGA D N G A G ... Each column of the profile pj(a) contains the amino acid frequencies in the multiple sequence alignment

A ... 0 0.25 0.75 0 0.2 0.4 0 0 ... C ... 0 0 0 0 0 0 0 0 ... D ... 0 0 0 0.2 0 0.2 0 0 ... E ... 0 0.25 0 0 0 0 0.4 0 ... F ... 0.2 0 0 0 0 0 0 0 ... G ... 0 0.25 0.25 0 0.2 0.2 0.4 0 ... H ... 0 0 0 0.2 0 0 0 0.4 ... I ... 0.2 0 0 0 0.2 0 0 0 ... K ... 0 0 0 0 0 0 0.2 0 ... L ... 0 0 0 0 0 0 0 0 ... M ... 0 0 0 0 0 0 0 0 ... N ... 0 0.25 0 0.6 0 0 0 0 ... P ... 0 0 0 0 0 0.2 0 0 ... … W ... 0 0 0 0 0 0 0 0 ... Y ... 0 0 0 0 0 0 0 0.2 ... M I ... 0 0 0.25 0 0 0 0 0 ... I  I ... 0 0 0.5 0 0 0 0 0 ... M D ... 0.2 0 0 0 0 0 0 0 ... D D ... 0 1.0 0 0 0 0 0 0 ... HMMs include position-specific gap penalties Match or Delete Deletions Insertions M/D M/D M/D I I M/D M/D M/D M/D M/D HBA_human ... V G A . . H A G E Y ... HBB_human ... V - - . . N V D E V ... MYG_phyca ... V E A . . D V A G H ... LGB2_luplu ... F N A . . N I P K H ... GLB1_glydi ... I A G a d N G A G V ... Probabilities for Insert Open Insert Extend Delete Open Delete Extend

I I I I I I I I M M M M M M M M D D D D D D D D A 0 0.25 0.75 0 0.2 0.4 0 0 C 0 0 0 0 0 0 0 0 … W 0 0 0 0 0 0 0 0 Y 0 0 0 0 0 0 0 0.2 M I 0 0 0.25 0 0 0 0 0 I  I 0 0 0.5 0 0 0 0 0 MD 0.2 0 0 0 0 0 0 0 DD 0 1.0 0 0 0 0 0 0 Profile HMMs can be represented as states connected by transitions M/D M/D M/D I I M/D M/D M/D M/D M/D HBA_human ... V G A . . H A G E Y ... HBB_human ... V - - . . N V D E V ... MYG_phyca ... V E A . . D V A G H ... LGB2_luplu ... F N A . . N I P K H ... GLB1_glydi ... I A G a d N G - G V ... … … HMM p Matrix: pi(a) pi(XY)

I I I I I I I I M M M M M M M M D D D D D D A 0 0.25 0.75 0 0.2 0.4 0 0 C 0 0 0 0 0 0 0 0 … W 0 0 0 0 0 0 0 0 Y 0 0 0 0 0 0 0 0.2 M I 0 0 0.25 0 0 0 0 0 I  I 0 0 0.5 0 0 0 0 0 MD 0.2 0 0 0 0 0 0 0 DD 0 1.0 0 0 0 0 0 0 Profile HMMs can be represented as states connected by transitions M/D M/D M/D I I M/D M/D M/D M/D M/D HBA_human ... V G A . . H A G E Y ... HBB_human ... V - - . . N V D E V ... MYG_phyca ... V E A . . D V A G H ... LGB2_luplu ... F N A . . N I P K H ... GLB1_glydi ... I A G a d N G - G V ... … … HMM p D D Matrix: pi(a) pi(XY)

I I I I I I I M M M M M M M M D D D D D D D D A 0 0.25 0.75 0 0.2 0.4 0 0 C 0 0 0 0 0 0 0 0 … W 0 0 0 0 0 0 0 0 Y 0 0 0 0 0 0 0 0.2 M I 0 0 0.25 0 0 0 0 0 I  I 0 0 0.5 0 0 0 0 0 MD 0.2 0 0 0 0 0 0 0 DD 0 1.0 0 0 0 0 0 0 Profile HMMs can be represented as states connected by transitions M/D M/D M/D I I M/D M/D M/D M/D M/D HBA_human ... V G A . . H A G E Y ... HBB_human ... V - - . . N V D E V ... MYG_phyca ... V E A . . D V A G H ... LGB2_luplu ... F N A . . N I P K H ... GLB1_glydi ... I A G a d N G - G V ... I … … HMM p Matrix: pi(a) pi(XY)

I I I I I I I M M M M M M M D D D D D D I I I I M M M M M D D D D D Find path through two HMMs that maximizes co-emission probability HMM q D State q State p M M M M M I M M M M D – M M I HMM p Co-emitted sequence x1 x2x3 x4 x5 x6 Include Null model maximize “log-sum-of-odds score” Söding, J. (2005) Bioinformatics 21, 951-960.

HHrep detects repeats by HMM-HMM comparison of the sequence with itself repeat 1 repeat 2 repeat 3 repeat 4 repeat 1 repeat 2 repeat 3 repeat 4 The dotplot with suboptimal alignments reveals internal symmetries

Outer membrane  barrels might have evolved by duplication of a single  hairpin OmpA … but is there an internal symmetry in the sequences?

HHrep indeed finds a fourfold sequence symmetry in OMPs 50 100 150 blue: significant alignments 50 OmpA 100 150 ompa_2

TIM barrels possess approximate structural symmetry … … but up to now it has not been possible to detect this repeat pattern on the sequence level

HHrep detects structural repeats in TIMs 1fq0a_1 1fq0a_2

Did TIM barrels evolve by duplication of a quarter barrel peptide? Fourfold symmetry Eightfold symmetry HisF KDPG aldolase same, but lower score threshold after consistency transformation profile-profile dot plot

HMM-HMM comparison improves upon profile-profile comparison All-against-all benchmark on SCOP (20% seq. id.) HMM-HMM+predSS HMM-HMM+SS HMM-HMM+corr HMM-HMM profile-profile profile-profile profile-profile 10% rate of false positives HMM-seq profile-seq seq-seq

The HHpred input page 1. Paste ScbA sequence 2. Select database 3. Submit job All input parameters are linked to explanations on help pages 8 ScbA from Steptomyces is involved in regulating the onset of antibiotics production, but its function is unknown

Search results: alignment view Create 3D model Graphical representation of best database hits along query sequence Statistical significance View template structure View template alignment Summary hit list for best database matches View alignments as histograms Predicted 2nd’ary structure (query) Query sequence (ScbA) Match quality Alignments with database sequences (templates) Template sequence: (from database) Actual 2nd’ary structure (template) Interesting region of high similarity . . . Predicted 2nd’ary structure (template) Six best hits belong to a superfamily of enzymes from the fatty acid synthesis pathway! 48830

Histogram view FabZ FabA FabZ Highly conserved arginine: catalytic ? Highly conserved residues E and Q are catalytic residues in FabZ / FabA!

Homology between histones and C-terminal subdomain in AAA+ ATPases RuvB (AAA+) TAFII62 kink TAFII42 Work in progress, V. Alva Kullanja and M. Ammelburg et al.

The prediction of transmembrane  barrel proteins is a challenging problem • TM β-barrel proteins occur in outer membranes of bacteria, mitochondria and plastids • TM β-barrel proteins are normally amphiphilic → more difficult to identify than α-helical TMPs • Only a handful of known structures exist • No structure of OmpW has yet been released→ use OmpW as test case OmpA MspA porin

Most dedicated TM β-barrel predictors fail to predict Erwinia carotovora OmpW correctly Server TBBpred(Chandigarh, India) TMBETA-NET (AIST, Tokyo) PROFtmb(Columbia University) Pred-TMBB(University of Athens) Result “Protein is likely to be globular” Confidence? Nine strands predicted with unrealistic positions Low confidence (Z-score 5.8 ≈ 35% accuracy)Six strands predicted with realistic positions Score below threshold;Nine strands predicted, 4 probably misplaced, 5 correct

HHpred model of Erwinia carotovora OmpW (default parameters, no refinements) Correct topology predicted, with 8 strands at realistic positions; High confidence for OMP prediction(Probability = 100%) Only needs refine-ment for precise placement of loop inserts Ompw_1 Ompw_3

HHsenser is a novel method to search for remote homologs in sequence databases • Recursive search strategy employing PSI-BLAST to build new aligynments that may be homologous to query • HMM-HMM comparison for validation of homology between newly built alignment and alignment of validated sequences • Very sensitive! . . . . . . . . . . . . . . . . . . . . . . . x . . . . . x . . . x . . . x . . . x . . . . . . query . . E<10-3 . . . . . . . . . . . . . E<10 . x shaded: accepted sequences . . . . . . . . .

HHsenser defines a diverse superfamily of transcription factors around AbrB/SpoVT C N’ N’ N C C’ N C’ C N’ C C’ N’ N N C’ C N N’ N N’ C’ C C’ N C C N C N N C Sequences obtained with HHsenser, clustered with CLANS: MazE (1mvf) YjiW Vir Archaeal PhoU PemI / MazE VagC AbrB (new, 1yfb) 1mvf cyano TF SpoVT 1yfb PrlF AbrB AbrB (1ekt) MraZ-N MraZ-C 1n0g 1n0g M. Coles et al. (2005) Structure 13, 919-928. MraZ (1n0g) abrb_1

Retroactive from Drosophila was identified in a screen in for chitin-associated defects wild type rtv mutant • The retroactive fly larvae are bloated and show a characterisitic disarrangement of chitin fibres in the cuticle • Except for the orthologous genes from D pseudoobscura and Anopheles, no homologs are found in the databases • Understanding chitin-related developmental and metabolic pathways is important for pest control

Based on remote homology with CD59 and snake toxins, HHpred could generate a 3D model for Rtv • Rtv is membrane-bound and adopts a three-finger neurotoxin fold • The long fingers carry two exposed aromatic residues each • These exposed residues are likely to binding chitin at the surface of epidermal cells B. Moussian, J. Söding, H. Schwarz, and C. Nüsslein-Volhard, Dev Dyn 2005 rtv_1 63951

HHsenser finds homology between P5 protein of phage phi-6 and lytic transglycosilases (default parameters) p5_2

HHpred confidently predicts Gas1 (target 5 from AFP-SIG) to be a GDNF receptor (default parameters, database: CDD) m a Gas1_1 In collaboration with Mart Saarma, Helsinki Gas1_2

Outlook • Toolkit as open-source package • Continuous integration of the best available tools • Several new tools planned or in development • Cluster known folds by sequence similarity(Galaxy of folds) • Functional subtyping • PDB remote homology alert •  barrel membrane protein prediction • Repeat detection (database-assisted) • Expert system

The Toolkit Team Johannes Söding Andrei Lupas Andreas Biegert Michael Remmert Christian Mayer • Many thanks to • Tancred Frickey, Markus Gruber, Alex Diemand, and Pavel Szczesny for contributing tools • Alexander Diemand for systems admin and support • Members of our group for critical feedback http://toolkit.tuebingen.mpg.de

Stucture is more conserved in evolution than function Conservation of structure Sequence identity Main-chain RMSD in conserved core Fraction of aas in conserved core 60% 50% 40% 30% 20% 0.85 Å 1.0 Å 1.2 Å 1.5 Å 1.8 Å 90% 80% 70% 60% 50% Structure prediction based on homology to template with known structure can yield useful 3D models even at sequence identities below 20% (twilight zone)

Sequence identity is a good indicator of functional similarity … Conservation of enzyme function (EC code) in proteins Conservation of substrate specificity (all four EC digits) Conservation of reaction mechanism (first three EC digits) Sequence identity 50% - 60% 40% - 50% 30% - 40% 20% - 30% 75% 60% 35% 15% 85% 70% 50% 25% … but function evolves quickly: below 50% direct functional inference gets problematic Analysis of conserved functional residues, comparative sequence analysis, structure prediction, …

Global versus local alignment global alignment query db match local alignment query db match BLAST and PSI-BLAST use a local alignment method HHpred can construct both local and global alignments • Probabilities / E-values more reliable for local alignment • Global alignment mode useful for making 3D models and for determination of structural domain boundaries

The Bioinformatics Toolkit at the MPI for Developmental Biology

The Bioinformatics Toolkit at the MPI for Developmental Biology

Presentation Transcript

Developmental Biology

Developmental Biology

Language Archiving at the MPI

Bioinformatics in the Biology Curriculum

Developmental Biology

Developmental Biology:

Biology for Bioinformatics

Developmental Biology – Biology 4361

The Origin of Developmental Biology

Biology 218: Developmental Biology

Biology Toolkit

Developmental Biology

Bioinformatics at the NIH

Developmental Biology

Developmental Biology

LECTURE COURSE „ Developmental Biology“ Introduction to Developmental Biology

Developmental Biology

Bioinformatics in the Biology Curriculum

The Zero-Force MPI Toolkit – Toward Tractable Toolkits for HPC

The Zero-Force MPI Toolkit – Toward Tractable Toolkits for HPC

Developmental Biology

Language Archiving at the MPI