230 likes | 361 Views
Index-based approach to similarity search in protein and nucleotide databases. David Hoksza, Tom áš Skopal Charles University in Prague Department of Software Engineering Czech Republic. Presentation Outline. Biological background Protein and nucleotide databases Current methods
E N D
Index-based approach to similarity search in protein and nucleotide databases David Hoksza, Tomáš Skopal Charles University in PragueDepartment of Software Engineering Czech Republic
Presentation Outline • Biological background • Protein and nucleotide databases • Current methods • dynamic programming • heuristic approach • Index based approach • Experiments DATESO 2007
Terminology • DNA (deoxyribonucleic acid) • sequence of nucleotides (A, C, G, T) • double-helix • RNA (ribonucleic acid) • single-helix sequence of nucleotides (A, C, G, U) • messenger RNA (mRNA) • transfer RNA (tRNA) • ribosomal RNA (rRNA) • … • proteins • molecules • translated from mRNA in ribosomes • sequence of amino acids (20 AAs) • coded by codon (triplet of nucleotides) • genetic code • DNA → RNA → protein central dogma transcription translation DATESO 2007
Protein Similarity • Interaction of proteins determines biological functions • Function of protein derived from it’s three dimensional structure • similar proteins (many common amino acids on “appropriate” places) have similar structure • → similar proteins have similar functions • similar proteins have a common ancestor • Determining protein sequence • → finding similar proteins • → getting clue to the function DATESO 2007
Protein and nucleotide Databases • Protein databases • finding similar proteins • even among different species • Nucleotide databases • finding similarities in non-coding (not transcribed) parts • finding whether sequence was already described • checking whether given segment was sequenced correctly • Prominent databases • GenBank • EMBL (European Molecular Biology Laboratory Data) • DDBJ (DNA Data Bank of Japan) • UniProt • Swissprot + trEMBL (translated EMBL) + PIR (Protein Information Resource) not moderated moderated DATESO 2007
Databases Growth DATESO 2007
Similarity Search • Similarity = alignment of 2 sequences • “correspondence” between 2 sequences • Standard methods for finding alignments • dot matrix method • dynamic programming • heuristic approach N P H G I - I - M G L - A E - - H G - A - L - G L L - E DATESO 2007
Similarity Measures • Need of defining a measure • Distances for measuring alignments of strings • Hamming distance • sequences of equal length • number of non-identical positions • Levenshtein (edit) distance • minimal number of editing operations (insert/update/delete) needed for convert one sequence to the other • Weighted edit distance • takes into account probability of updating one letter to the other • distance matrix • biologically correct • PAM, BLOSUM, … DATESO 2007
Dynamic Programming – Global Alignment BLOSUM 62 gap cost … -1 • Global alignment • aligning whole sequences • weighted edit distance • Needleman-Wunsch • optimal alignment between 2 sequences a and b • distance matrixδ • gap cost σ • si,j– optimal alignment of prefixes a and b of length i and j • s0,j = j*σ, si,0 = i*σ • s|a|, |b| … value of the optimal alignment N P H G I I M G L A E -1 -1 +8 +6 -1 -1 +2 +6 +4 -1 -1 20 - - H G - - L G L - - adding gap to a O(|a||b|) adding gap to b align ai and bj DATESO 2007
Dynamic Programming – Local Alignment • Local alignment • best global alignment of all pairs of subsequences of a and b • Smith-Waterman • modification of Needleman-Wunsch • allowing “free ride” from the start by incorporating zero value • s0,j = 0, si,0 = 0 • max(si,j) … value of optimal alignment BLOSUM 62 gap cost … -11 N P H G I I M G L A E +8 +6 +2 16 H G L gap extending with cost of σ DATESO 2007
Smith-Waterman, BLOSUM62 open gap -11, extend gap -11 0 0 0 0 0 0 0 0 0 0 0 0 1 -11 0 1 0 8 0 0 0 0 0 0 0 0 -11 0 -11 0 0 0 0 14 3 0 0 6 0 0 0 -11 -3 -11 0 0 0 0 3 16 5 2 0 10 0 0 -11 0 0 0 0 6 5 12 2 8 0 10 0 0 0 0 0 0 8 7 14 3 12 1 7
Heuristic approach • O(|a||b|) is expensive • → heuristic approach • BLAST (Basic Local Alignment Search Tool) • Remove low complexity regions • Generate all n-grams from query sequence • Compute the similarity for every sequence of length n and each n-gram from the previous step • Filter out sequences with similarity lower then a cut-off score • Exact match of remaining (high-scoring) sequences (organized in a search tree) with DB • Connecting matched high-scoring sequences within a given distance with gapped alignment and extending → high scoring pairs (HSP) • HSPs with score under given thrashold are excluded • Remaining sequences aligned by Smith-Waterman algorithm with original query sequence DATESO 2007
Statistical Relevance • What is the probability that a alignment happened by chance? • Using statistics (distribution function) of ungapped local alignment • applying to gapped alignment (empirically tested) • E-value … expected number of sequences of length m and n with score at least S • K, e … depended on distance matrix • Taking size of database N and length of the query into account DATESO 2007
Metric Access Methods (MAM) • Given a metric, MAMs are used to organize objects • only promising groups of objects have to be search while querying • MAMs use metric function as a “black box” • (→ local alignment can be used) • Examples • M-tree, PM-tree, LAESA, vp-tree, GNAT, D-Index… • Metric(Oi, Oj, Ok U) • reflexivity d(Oi, Oj) = 0 Oi = Oj • positivity d(Oi, Oj) > 0 Oi Oj • symmetry d(Oi, Oj) = d(Oj, Oi) • triangular inequalityd(Oi, Oj) + d(Oj, Ok) d(Oi, Ok) DATESO 2007
Creating a metric • What distance function use? • Smith-Watterman • doesn’t take sequence length into account • no statistical relevance • E-value with SW • takes statistical relevance, query length and database length into account • standard in biological databases • problems • reflexivity • same sequences → E-value = 0 • symmetry • triangular inequality DATESO 2007
TriGen algorithm • Turning semi-metric (metric without triangular inequality) into metric • applying triangular generating (TG) modifiers (functions) to original distance function • TG is every concavesimilarity preserving (SP) modifier • increasing intrinsic dimensionality → decreasing index efficiency • TG-error tolerance • ratio of triangular triplets to non-triangular triplets • = 0 > 0 • exact search approximate search • tradeoff between correctness and efficiency DATESO 2007
Experimental Results • Swissprot • subset of size 3000 sequences (1,041,000 aminoacids) • average sequence length 335 • maximal sequence length limited to 1000 • only 3% of sequences are longer → special treatment • Testing of • distance computations • computational costs • number of letter comparisons • TG error, real error DATESO 2007
BLAST computational costs estimation • finding neigbhbouring sequences • 54sequences (empirically) • → 81784 distance computations (comparisons) • → 245352 computational operations (3-grams) • average number of neighbouring words for query sequences – 54 • → search tree of height 6 • → 6 * 1,041,000 * 3 = 1,873,800 computational operations for every search DATESO 2007
Experiments – E-value DATESO 2007
Experiments – E-value DATESO 2007
Experiments – Error tolerance TriGen error tolerance TriGen error tolerance DATESO 2007
Conclusion • We have analyzed • standard current methods used for searching protein and nucleotide databases • We have implemented • indexing of protein sequences by MAMs • can be used for nucleotide sequences as well • Experimental results • have shown that using MAMs without search space modification doesn’t result into significant advantage over sequential scan DATESO 2007
References [1]Skopal T., Pokorný J., Snášel V.: Nearest Neighbours Search using the PM-tree, DASFAA 2005, Beijing, China [2] Skopal T.: On Fast Non-Metric Similarity Search by Metric Access Methods, EDBT 2006, Munich, Germany DATESO 2007