1 / 13

BLOSUM Information Resources

BLOSUM Information Resources. Algorithms in Computational Biology Spring 2006. Created by Itai Sharon. BLOSUM (BLOck Substitution Matrices). Publication Henikoff and Henikoff, 1992 Motivation PAM matrices do not capture the difference between short and long time mutations Method

mirra
Download Presentation

BLOSUM Information Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BLOSUMInformation Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon

  2. BLOSUM (BLOck Substitution Matrices) • Publication • Henikoff and Henikoff, 1992 • Motivation • PAM matrices do not capture the difference between short and long time mutations • Method • For several degrees of sequence divergence, derive mutations from set of related proteins • BLOSUM-k is based on related proteins with k% identity or less

  3. BLOSUM – Method • Use Blocks – collections of multiple alignments of similar segments without gaps • Cluster together sequences whenever more than k% identical residues are shared • Count number of substitutions across different clusters (in the same family) • Estimate frequencies using the counts

  4. BLOCKS • Each BLOCK represents a conserved region in a group of proteins 1 5 n sequence 1 ABPEDG… …FGW sequence 2 ABSEDQ… …QGW sequence 3 SBPEDQ… …FGD : : : : : : sequence m ABAEDS… …QGD

  5. Obtaining Accepted Mutations from BLOCKS • For each column we compute the frequency of each pair (a, b) of amino acids a • E.g: if(m=10, column i contains 9 A’s and 1 S, then fAA =8+7+…+1=36 and fAS=9. • Total number of pairs per column: m(m-1)/2 • The probability to observe a pair (a, b) is given by

  6. The Null Hypothesis • The Background distribution of amino acid a is given by: • The null hypothesis: • E.g: in the above example – • eAS = 2 · 0.9 · 0.1= 0.18 • eAA = 0.9 · 0.9= 0.81 • eSS = 0.1 · 0.1= 0.01

  7. The LOD Ratio • The LOD Ratio is given by: • Properties: • sab>0  qab>eab, observed frequencies are more than expected • sab=0  qab=eab, observed frequencies are as expected • sab<0  qab<eab, observed frequencies are less than expected

  8. Constructing the Different BLOSUM-k Matrices • The idea: create substitution matrices that are based on different degrees of identity • How: cluster all sequences similar in more than k% and treat them as a single sequence • Example: Suppose k=80 and 8 of 9 sequences with A in the 9A-1S column are identical in more than 80% • fAA=1, fAS=2, fSS=0

  9. Information Resources • NCBI • GenBank • PDB and SCOP • GO • There are many many more…

  10. NCBI • Contains several databases and tools for molecular biology research • E.g: BLAST, PubMed, GenBank and more • URL: http://www.ncbi.nih.gov

  11. GenBank • GenBank is an annotated collection of all publicly available DNA sequences • Data is partitioned into ‘divisions’ that roughly correspond to taxonomic groups (e.g bacteria, viruses, primates etc.) • Statistics: • DNA sequences for more than 165K organisms (2005) • ~55M DNA sequences • 60G bases • URL: URL: http://www.ncbi.nlm.nih.gov/GenBank/

  12. Protein Data Bank (PDB) and SCOP • PDB is a database of known protein structures • Currently contains ~36K known structures • SCOP is a classification of proteins from PDB • Family – clear evolutionary relationship • Superfamily – Probable common evolutionary origin • Fold – major structural similarity • URLs: • PDB – http://www.rcsb.org • SCOP – http://scop.berkeley.org

  13. Gene Ontology (GO) • The GO project “… is a collaborative effort to address the need for consistent descriptions of gene products in different databases” • Kept in the form of directed graph originating from one root • Nodes are the different GO terms (more than 17K now exist) • Node may have more than one parent • Three main branches: biological process, molecular function and cellular components • URL: http://www.geneontlogy.org

More Related