BLOSUM Information Resources

BLOSUMInformation Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon

BLOSUM (BLOck Substitution Matrices) • Publication • Henikoff and Henikoff, 1992 • Motivation • PAM matrices do not capture the difference between short and long time mutations • Method • For several degrees of sequence divergence, derive mutations from set of related proteins • BLOSUM-k is based on related proteins with k% identity or less

BLOSUM – Method • Use Blocks – collections of multiple alignments of similar segments without gaps • Cluster together sequences whenever more than k% identical residues are shared • Count number of substitutions across different clusters (in the same family) • Estimate frequencies using the counts

BLOCKS • Each BLOCK represents a conserved region in a group of proteins 1 5 n sequence 1 ABPEDG… …FGW sequence 2 ABSEDQ… …QGW sequence 3 SBPEDQ… …FGD : : : : : : sequence m ABAEDS… …QGD

Obtaining Accepted Mutations from BLOCKS • For each column we compute the frequency of each pair (a, b) of amino acids a • E.g: if(m=10, column i contains 9 A’s and 1 S, then fAA =8+7+…+1=36 and fAS=9. • Total number of pairs per column: m(m-1)/2 • The probability to observe a pair (a, b) is given by

The Null Hypothesis • The Background distribution of amino acid a is given by: • The null hypothesis: • E.g: in the above example – • eAS = 2 · 0.9 · 0.1= 0.18 • eAA = 0.9 · 0.9= 0.81 • eSS = 0.1 · 0.1= 0.01

The LOD Ratio • The LOD Ratio is given by: • Properties: • sab>0  qab>eab, observed frequencies are more than expected • sab=0  qab=eab, observed frequencies are as expected • sab<0  qab<eab, observed frequencies are less than expected

Constructing the Different BLOSUM-k Matrices • The idea: create substitution matrices that are based on different degrees of identity • How: cluster all sequences similar in more than k% and treat them as a single sequence • Example: Suppose k=80 and 8 of 9 sequences with A in the 9A-1S column are identical in more than 80% • fAA=1, fAS=2, fSS=0

Information Resources • NCBI • GenBank • PDB and SCOP • GO • There are many many more…

NCBI • Contains several databases and tools for molecular biology research • E.g: BLAST, PubMed, GenBank and more • URL: http://www.ncbi.nih.gov

GenBank • GenBank is an annotated collection of all publicly available DNA sequences • Data is partitioned into ‘divisions’ that roughly correspond to taxonomic groups (e.g bacteria, viruses, primates etc.) • Statistics: • DNA sequences for more than 165K organisms (2005) • ~55M DNA sequences • 60G bases • URL: URL: http://www.ncbi.nlm.nih.gov/GenBank/

Protein Data Bank (PDB) and SCOP • PDB is a database of known protein structures • Currently contains ~36K known structures • SCOP is a classification of proteins from PDB • Family – clear evolutionary relationship • Superfamily – Probable common evolutionary origin • Fold – major structural similarity • URLs: • PDB – http://www.rcsb.org • SCOP – http://scop.berkeley.org

Gene Ontology (GO) • The GO project “… is a collaborative effort to address the need for consistent descriptions of gene products in different databases” • Kept in the form of directed graph originating from one root • Nodes are the different GO terms (more than 17K now exist) • Node may have more than one parent • Three main branches: biological process, molecular function and cellular components • URL: http://www.geneontlogy.org

BLOSUM Information Resources