130 likes | 148 Views
BLOSUM Information Resources. Algorithms in Computational Biology Spring 2006. Created by Itai Sharon. BLOSUM (BLOck Substitution Matrices). Publication Henikoff and Henikoff, 1992 Motivation PAM matrices do not capture the difference between short and long time mutations Method
E N D
BLOSUMInformation Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon
BLOSUM (BLOck Substitution Matrices) • Publication • Henikoff and Henikoff, 1992 • Motivation • PAM matrices do not capture the difference between short and long time mutations • Method • For several degrees of sequence divergence, derive mutations from set of related proteins • BLOSUM-k is based on related proteins with k% identity or less
BLOSUM – Method • Use Blocks – collections of multiple alignments of similar segments without gaps • Cluster together sequences whenever more than k% identical residues are shared • Count number of substitutions across different clusters (in the same family) • Estimate frequencies using the counts
BLOCKS • Each BLOCK represents a conserved region in a group of proteins 1 5 n sequence 1 ABPEDG… …FGW sequence 2 ABSEDQ… …QGW sequence 3 SBPEDQ… …FGD : : : : : : sequence m ABAEDS… …QGD
Obtaining Accepted Mutations from BLOCKS • For each column we compute the frequency of each pair (a, b) of amino acids a • E.g: if(m=10, column i contains 9 A’s and 1 S, then fAA =8+7+…+1=36 and fAS=9. • Total number of pairs per column: m(m-1)/2 • The probability to observe a pair (a, b) is given by
The Null Hypothesis • The Background distribution of amino acid a is given by: • The null hypothesis: • E.g: in the above example – • eAS = 2 · 0.9 · 0.1= 0.18 • eAA = 0.9 · 0.9= 0.81 • eSS = 0.1 · 0.1= 0.01
The LOD Ratio • The LOD Ratio is given by: • Properties: • sab>0 qab>eab, observed frequencies are more than expected • sab=0 qab=eab, observed frequencies are as expected • sab<0 qab<eab, observed frequencies are less than expected
Constructing the Different BLOSUM-k Matrices • The idea: create substitution matrices that are based on different degrees of identity • How: cluster all sequences similar in more than k% and treat them as a single sequence • Example: Suppose k=80 and 8 of 9 sequences with A in the 9A-1S column are identical in more than 80% • fAA=1, fAS=2, fSS=0
Information Resources • NCBI • GenBank • PDB and SCOP • GO • There are many many more…
NCBI • Contains several databases and tools for molecular biology research • E.g: BLAST, PubMed, GenBank and more • URL: http://www.ncbi.nih.gov
GenBank • GenBank is an annotated collection of all publicly available DNA sequences • Data is partitioned into ‘divisions’ that roughly correspond to taxonomic groups (e.g bacteria, viruses, primates etc.) • Statistics: • DNA sequences for more than 165K organisms (2005) • ~55M DNA sequences • 60G bases • URL: URL: http://www.ncbi.nlm.nih.gov/GenBank/
Protein Data Bank (PDB) and SCOP • PDB is a database of known protein structures • Currently contains ~36K known structures • SCOP is a classification of proteins from PDB • Family – clear evolutionary relationship • Superfamily – Probable common evolutionary origin • Fold – major structural similarity • URLs: • PDB – http://www.rcsb.org • SCOP – http://scop.berkeley.org
Gene Ontology (GO) • The GO project “… is a collaborative effort to address the need for consistent descriptions of gene products in different databases” • Kept in the form of directed graph originating from one root • Nodes are the different GO terms (more than 17K now exist) • Node may have more than one parent • Three main branches: biological process, molecular function and cellular components • URL: http://www.geneontlogy.org