570 likes | 676 Views
P a t t e r n d a t a b a s e s. Gopalan Vivek. Pattern databases - topics. Definition Applications Classifications Common Databases Conclusions. Pattern databases. Definition Applications Classifications Common Databases Conclusions. Pattern databases – definition.
E N D
Patterndatabases Gopalan Vivek
Pattern databases - topics • Definition • Applications • Classifications • Common Databases • Conclusions
Pattern databases • Definition • Applications • Classifications • Common Databases • Conclusions
Pattern databases – definition • Secondary databases derived from conserved obtained from multiple sequence alignment of primary databases such as GenBank, EMBL,DDBJ, SP/TrEMBL,PIR,etc patterns
Primary databases (SWISS-PROT - Protein GenBank - DNA) Millions of sequences Pattern Extraction - Multiple sequence alignment Pattern databases Thousands of patterns
Pattern databases • Definition • Applications • Classifications • Common Databases • Conclusions
Pattern Databases - Applications • Function prediction of protein/ nucleotide sequences even when sequence similarity is low (<25%). • Useful for classification of protein sequences into families. • It takes less time to search the pattern than the primary database. • Since “patterns” is the compact representation of features of many sequences.
Pattern databases • Definition • Applications • Classifications • Common Databases • Conclusions
Family based databases – considers full MSA Multiple Sequence Alignment (MSA) Motif -3 Motif -1 Motif based databases – considers local regions in MSA
Motif based PROSITE PRINTS BLOCKS Family based ProDom PIR-ALN ProtoMap DOMO ProClass Pfam SMART TIGRFAMs SBASE SYSTERS Pattern Databases – Protein
InterPro - Integrated resources of protein families and sites • PROSITE • PRINTS • BLOCKS • Pfam • ProDom InterPro
Pattern databases • Definition • Applications • Classifications • Common Databases • PROSITE, PRINTS, BLOCKS & SMART (motif based) • MetaFam, InterPro (Integrated databases) • Conclusions
Databases – General Tips • Source • Input formats & parameters • Output formats • Quality of the data • Other details – updates, coverage, speed, download, reference, methods etc.
Focus • To search pattern databases using the text or keyword search options in them for “Alkaline phosphatase” enzyme. • To analyze the quality of results from each of these database • Sensitivity, specificity. • Sequence & Pattern searches - In the afternoon’s practical.
PROSITE http://www.expasy.org/prosite/ • consists of biologically significant protein sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. • Based on SWISSPROT/TrEMBL
http://www.expasy.org/prosite/ ID and text Search Text Search Sequence Scanner
Result: PROSITE Documentaion page PROSITE ID Details about the pattern/profile PROSITE Pattern [IV]-x-D-S-[GAS]-[GASC]-[GAST]-[GA]-T [S is the active site residue]
Detailed View - page 1 PROSITE Pattern Numerical Results
Detailed View - page 2 True Positives False Positives View entry in raw text format (no links)
ID Identification AC Accession number DT Date DE Short description PA Pattern MA Matrix/profile RU Rule NR Numerical results CC Comments DR Cross-references to SWISS-PROT 3D Cross-references to PDB DO Pointer to the documentation file // Termination line
Highly degenerate protein structural and functional domains • immunoglobulin domains, SH2 and SH3 domains. • Consensus sequences of repetitive DNA elements • SINEs, LINEs • Basic gene expression signals • promoter elements, RNA processing signals, translational initiation sites. • DNA-binding protein motifs. • Protein and nucleic acid compositional domains • glutamine-rich activation domains, CpG islands.
PROSITE - features • Completeness • High specificity • Documentation • Periodic reviewing • Parallel update with SWISS-PROT(primary database)
motif Multiple Sequence Alignment cydeggis cyedggis cyeeggit cyhgdggs cyrgdgnt Find 4-5 functionally conserved residues C-Y-x2-[DG]-G-x-[ST] CORE PATTERN Increase the sequence length of the pattern SWISS-PROT More FALSE POSITIVES ? PROSITE DB YES NO
http://bioinf.man.ac.uk/dbbrowser/PRINTS/ • Protein fingerprint database • Fingerprint - set of motifs used that represent the most conserved regions of multiple sequence alignment. • Improved diagnostic reliability than single motif methods • Source – SWISSPROT/TrEMBL
motif Multiple Sequence Alignment xxxxxxx xxxxxxx xxxxxxx xxxxxxx cydeggis cyedggis cyeeggit cyhgdggs xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx Identification of ALL the conserved regions fingerprint Frequency matrices Creation of frequency matrices SWISS-PROT / Tr-EMBL Iterative database scanning of the frequency matrices with protein databases till convergence PRINTS DB
Database ID , no. of motifs and text Search Motif scanner (for searching a sequence or pattern against PRINTS database) http://bioinf.man.ac.uk/dbbrowser/PRINTS/
Page 1 for ‘alkaline phosphatase’ entry in PRINTS Documentation,Links & references
Page 2 Fingerprint details Sequence Summary
Page 3 Motif no. 1 Motif no. 2 “Raw” motif SWISSPROT -IDs Start and Interval between motifs in the fingerprint
BLOCKS http://blocks.fhcrc.org/blocks/ • Blocksare multiple aligned ungapped segments corresponding to the most highly conserved regions of proteins • The BLOCKS database is a collection of blocks representing known protein families that can be used to compare a protein or DNA sequence with documented families of proteins.
Blocks Making • Blocks are produced by the automated PROTOMAT system (Henikoff and Henikoff, 1991), which applies a robust motif-finder to a set of related protein sequences.
Sequence, no. of blocks and text Searches Blocks Maker http://blocks.fhcrc.org/blocks/
Page 1 Summary Search methods using blocks
Page 2 BLOCK - 1 Weak Blocks - Strength < 1100 Strong Blocks - Strength >= 1100 SWISSPROT ID Represent start position of the block
http://smart.embl-heidelberg.de/ • Contains >500 domain families associated with signaling, extra-cellular and chromatin-associated proteins are found. • Each domain is extensively annotated with phyletic distributions, functional class, tertiary structures and functionally important residues.
ID & sequence Search Domain & GO search ID and text Search Alkaline Phosphatase
Results – Alkaline phosphatase “Signatures” • PROSITE • Represented as a single motif. • PRINTS • Represented as 5motif regions. • BLOCKS • Represented as 6 block regions • SMART • Represented as a single profile
Composite Pattern Databases • MetaFam • InterPro • CDD (conserved Domain Database) • IProClass
Metafam & PANAL • Metafam - http://metafam.ahc.umn.edu/ • PANAL – Protein ANALysis tool page of Metafam http://mgd.ahc.umn.edu/panal/ • Protein family classification built with Blocks+, DOMO, Pfam, PIR-ALN, PRINTS, Prosite, ProDom, SBASE, SYSTERS.
Interpro • http://www.ebi.ac.uk/interpro • Built from PROSITE, PRINTS, Pfam, ProDom, SMART, TIGRFAM, SWISS-PROT and TrEMBL • Text- and sequence-based searches.