1 / 35

Lesson 3 Database Similarity Search

Lesson 3 Database Similarity Search. Sequence Similarity search is a key to discover new functions. Basic assumption. Similar sequences. Similar function. WHY?. Have the required properties to undertake the function Come from the same origin. new sequence. ?. Similar function. ≈.

jod
Download Presentation

Lesson 3 Database Similarity Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lesson 3 Database Similarity Search

  2. Sequence Similarity search is a key to discover new functions Basic assumption Similar sequences Similar function WHY? • Have the required properties to undertake the function • Come from the same origin

  3. new sequence ? Similar function ≈ Discover Function of a new sequence Sequence Database

  4. Searching Databases for similar sequences Due to the huge number and size of the databases using exact algorithm to compare a sequence (query) to all sequences in the databases is not feasible. Solution: Use a heuristic (approximate) algorithm

  5. Heuristic strategy Perform efficient search strategies Preprocess database into new data structure to enable fast accession

  6. BLAST Basic Local Alignment Search Tool • General idea - a good alignment contains subsequences of high identity (local alignment): ACGCCCGGGAGCGC CTGGGCGTATAGCCC • First, identify (most efficiently) short almost exact matches . • Next, extended to longer regions of similarity. • Finally, optimize the alignment an exact algorithm. Altschulet al 1990

  7. Similar to pairwise sequence alignments BLAST can be used for DNA/RNA (nucleotide) sequences or for proteins sequence (amino acids) • BLASTN(Nucleotide) • BLASTP(Protein)

  8. DNA/RNA vs protein alphabet DNA(4) RNA(4) Protein (20) A T G C A U G C ACDEFGHIKLMNPQRSTVWY A T=A G…. A T=A G…. A G>>A W…. WHY is it different?

  9. The 20 Amino Acids

  10. The 20 Amino Acids A G W

  11. Scoring system for amino acids mismatches

  12. BLAST(Protein Sequence Example) 1. Identify (most efficiently) short almost exact matches between the query sequence and the database. Query sequence…FSGTWYA… Words of length 3: FSG, SGT, GTW, TWY, WYA

  13. BLAST Preprocessing of the database Seq 1 FSGTWYA FSG, SGT, GTW, TWY, WAY Seq 2 FDRTSYV FDR, DRT, RTS, TSY, SYV Seq 3 SWRTYVA SWR, WRT,RTY, TYV, YVA ……. FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG.. SVT. GSW. TWF.. WYS…. Seq 1 BAG OF WORDS (BOW) Seq 102 Seq 3546

  14. BLAST Query sequence …FSGTWYA… Words of length 3: FSG, SGT, GTW, TWY, WYA… DATABASE FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS…. SEQ N INVIEIAFDGTWTCATTNAMHEWASNINETEEN

  15. BLAST 2. Extend word pairs as much as possible (No Gaps) until the local alignment score meets or exceeds a threshold or cutoffscore (t)  HSP High-scoring Segment Pairs (HSPs) Q: FIRSTLINIHFSGTWYAAMESIRPATRICKREAD D: INVIEIAFDGTWTCATTNAMHEWASNINETEEN 3. Finally, optimize the alignment using an exact algorithm. Q= query sequence, D= sequence in database

  16. Treating Gaps in BLAST >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Human mRNA CATGCGACTGACATCGATCATA BLAST by definition is a local alignment tool

  17. Sometimes we want to include gaps in alignments! • Standard solution: affine gap model wx = g + r(x-1) wx : total gap penalty; g: gap open penalty; r: gap extend penalty ;x: gap length • Once-off cost for opening a gap • Lower cost for extending the gap • Changes required to algorithm

  18. Running BLAST to predict a function of a new protein >Arrestin protein (C. elegance) MFIANNCMPQFRWEDMPTTQINIVLAEPRCMAGEFFNAKVLLDSSDPDTVVHSFCAEIKG IGRTGWVNIHTDKIFETEKTYIDTQVQLCDSGTCLPVGKHQFPVQIRIPLNCPSSYESQF GSIRYQMKVELRASTDQASCSEVFPLVILTRSFFDDVPLNAMSPIDFKDEVDFTCCTLPF GCVSLNMSLTRTAFRIGESIEAVVTINNRTRKGLKEVALQLIMKTQFEARSRYEHVNEKK LAEQLIEMVPLGAVKSRCRMEFEKCLLRIPDAAPPTQNYNRGAGESSIIAIHYVLKLTAL PGIECEIPLIVTSCGYMDPHKQAAFQHHLNRSKAKVSKTEQQQRKTRNIVEENPYFR

  19. Running BLAST to predict a function of a new protein

  20. Running BLAST to predict a function of a new protein

  21. How to interpret a BLAST score: • The score is a measure of the similarity of the query to the sequence shown. How do we know if the score is significant? -Statistical significance -Biological significance

  22. How to interpret a BLAST search: For each blast score we can calculate an expectation value (E-value) The expectation value E-value is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. page 105

  23. BLAST- E value: Increases linearly with length of query sequence Decreases exponentially with score of alignment Increases linearly with length of database m = length of query ; n= length of database ; s= score • K ,λ: statistical parameters dependent upon scoring system and background residue frequencies

  24. What is a Good E-value (Thumb rule) • E values of less than 0.00001 show that sequences are almost always related. • Greater E values, can represent functional relationships as well. • Sometimes a real (biological) match has an E value > 1 • Sometimes a similar E value occurs for a short exact match and long less exact match

  25. How to interpret a BLAST search: • The score is a measure of the similarity of the query to the sequence shown. How do we know if the score is significant? -Statistical significance -Biological significance

  26. (How) can we decide if two sequences really have the same function? Homolog = come from a common origin => have the same function

  27. Homologous proteins = come from a common origin => have the same function Last Universal Common Ancestor

  28. Homology Rule of thumb:-Proteins are homologous if 25%-35% identical -DNA sequences are homologous if 70% identical Can we always go by the rules?

  29. Alignment between the worm and human arrestin VERY SIGNIFICANT , NOT HIGH IDENTITY

  30. Assessing whether proteins are functional homologous High levels of a protein RBP4 (Retinol binding protein 4) and PAEP (pregnancy associated protein) were found to be correlated with pre-eclampsia High levels of a protein RBP4 (Retinol binding protein 4) were found to be correlated with childhood obesity RBP4= carrier of vitamin A in the blood PAEP= Pregnancy associated protein

  31. Are they functionally homologous??? PAEP RBP4

  32. Assessing whether proteins are functional homologous RBP4 (retinol binding) and PAEP (pregnancy protein) E value= 0.49; identity=24% Are they functionally homologous???

  33. The lipocalins protein family (each dot is a protein) PAEP RBP4 retinol-binding protein odorant-binding protein apolipoprotein D

  34. Are they functionally homologous??? PAEP RBP4 They belong to the same protein family= have a common ancestor Their functions have probably diverse

More Related