CDD – a conserved domain database Aron Marchler-Bauer NCBI, National Library of Medicine, NIH

CDD – a conserved domain database Aron Marchler-Bauer NCBI, National Library of Medicine, NIH DIMACS Workshop on Protein Domains: Identification, Classification and Evolution February 27-28, 2003

CDD: a collection of domain multiple alignments linkedto protein 3D structure • imported alignment models mirrored ‘as-is’, sources are Pfam, Smart, COGs (close to 10,000) • curated alignment models (about 300) • part of NCBI’s Entrez query/retrieval system • RPS-Blast to search PSSMs derived from alignment models

Entrez with CDD … Term Frequency Statistics MEDLINE Abstracts BLAST Sequence Similarity Protein Sequences Domain Architecture Similarity 3D Structures CD-protein Links VAST Structure Similarity Conserved Domains Protein Sequences Related Conserved Domains

Conserved Domains as part of Entrez to … .. annotate three-dimensional structures

Conserved Domains as part of Entrez to … .. annotate protein sequences

Conserved Domains as part of Entrez to … .. neighbor proteins by domain architecture Currently (CDD v1.60): ~5 Mio protein-CD links

RPS-Blast (Reverse Position-Specific Blast) (Psi)-Blast RPS-Blast Search query (and its PSSM) against Search query protein sequence sequence database against a database of PSSMs Lookup table holds possible word Lookup table holds possible wordmatches to query, database sequences matches to database PSSMs, queryare scanned for single or multiple sequences are scanned for single orword matches, which are then multiple word matches, which areextended to identify statistically then extended to identify statisticallysignificant alignments. significant alignments. How does it compare?

The effect of the search heuristics can be measured directly against IMPALA, a similar program using the rigorous Smith-Waterman algorithm. Test set: Smart v3.3, 569 Domain Families / Alignments / PSSMs 23736 protein sequences used in alignments 14100 protein sequences from the initial Drosophila genome set.

The effect of the search heuristics and the differences in alignment model encoding can be measured against HMMer Test set: Smart v3.3, 569 Domain Families / Alignments / PSSMs 23736 protein sequences used in alignments 14100 protein sequences from the initial Drosophila genome set.

RPS-Blast vs. IMPALA, Speed vs. Sensitivity

Self-recognition: Fraction of sequence fragments used to build up the alignment model, which yield significant scores when compared with the search model. • Information content: sum Sp•log(p/q) across aligned columns • The average alignment information content for 568 models used in the test is 240 bits. • for 26 families – about 5% - self-recognition works better with IMPALA (detectable heuristics effects). • the average alignment information content for these 26 models is 100 bits. • for 542 models – about 95% - we did not detect heuristics effects in a self-recognition test.

In 65 families (~11%) more than 5% difference in self recognition between HMMer and RPS-Blast • Their mean information content is 65 bits • In 503 families (89%) less than 5% difference in self recognition.

Conclusions: • differences, maybe not too surprising • affecting a fairly small subset of the models at the lower end of the ‘informativeness’ spectrum • can optimize PSSM calculation, but might see diminishing returns • it may be more effective to deal with scope of models • Need to do something about the model collection • … curation of alignment models

Recording conserved features in CDD …

Conserved Features in CDs: • catalytic, binding, interaction- and regulatory sites • explain observed patterns of sequence conservation • annotate if applicable to all aligned members • annotate if evidence is available (3D structure, citation)

Collection has become redundant: Search results for 2SRC (Tyrosine Kinase) Right now: about 9400 CD-CD links in Entrez

Collection has become redundant: Search results for 1G291 (Malk) Many ATP-ase domains are sequence-similar to each other, and possibly related by descent from a common ancestor How to explain thisredundancy?

Curation: • literature check • examination of the conserved domain extent • examination of the multiple alignment, identification of a core substructure, establishment of a block-based alignment in agreement with 3D-structure data • Feature annotation and recording of evidence • Investigation of ‘related’ domains and their apparent relationships, resolving and recording the family hierarchy • Update of CD alignment models with new sequences and 3D-structure data

Curation needs to deal with: • noise from sequence data (gene models, annotation) • noise from alignments / alignment methods

Block alignment model and family hierarchies: Parent alignment • Children: • Membership consistency • Alignment consistency

Rizzi and Schindelin, Curr. Opin. Struct. Biol. 2002, 12:709-720

.. sequences used in the alignment hit a variety of models in CD-Search:

… examine domain architectures as recorded in CDART:

… validate sequences, validate alignment block structure, and examine sequence tree:

Pfam PF0994 MoCF_biosynth MoeA_N MoeA_C MoaB COGs CinA MoeA cd00758 CDD cd00758_a (MoaB) cd00758_c (CinA) cd00758_b (MoeA)

Concept borrowed from COGs – pattern of phylogenetic distribution as evidence for functional divergence after gene duplication events

Principles for establishing CD-Hierarchies: • Economy – too many families slow down search system • Search performance – flat alignment models must be split • Domain age – we’re primarily interested in sets of ancient conserved domains • Domain architectures • Subgroup-specific features 3.5 bio 2.6 bio 1.7 bio Plants Animals Archaea Alpha-proteobact. Gram+

Future directions: ability to describe complex hierarchies, which will allow modeling of fusion events ABC DEFG ABC_1 ABC_2 DEFG_1 DEFG_2 ABC_2 DEFG_2

Credits: John Anderson Natalie Fedorova John Jackson Aviva Jacobs Cynthia Liebert Gabriele Marchler Raja Mazumder B. Sridhar Rao Carol DeWeese-Scott James Song Sona Vasudevan Roxanne Yamashita Jodie Yin PFAM SMART COGs BLAST team Entrez team Taxonomy team NCBI Help-Desk Steve Bryant Lewis Geer Siqian He David Hurwitz Christopher Lanczycki Charlie Liu Tom Madej Anna Panchenko Ben Shoemaker Vahan Simonyan Paul Thiessen Yanli Wang

CDD – a conserved domain database Aron Marchler-Bauer NCBI, National Library of Medicine, NIH