350 likes | 360 Views
Lesson 3 Database Similarity Search. Sequence Similarity search is a key to discover new functions. Basic assumption. Similar sequences. Similar function. WHY?. Have the required properties to undertake the function Come from the same origin. new sequence. ?. Similar function. ≈.
E N D
Lesson 3 Database Similarity Search
Sequence Similarity search is a key to discover new functions Basic assumption Similar sequences Similar function WHY? • Have the required properties to undertake the function • Come from the same origin
new sequence ? Similar function ≈ Discover Function of a new sequence Sequence Database
Searching Databases for similar sequences Due to the huge number and size of the databases using exact algorithm to compare a sequence (query) to all sequences in the databases is not feasible. Solution: Use a heuristic (approximate) algorithm
Heuristic strategy Perform efficient search strategies Preprocess database into new data structure to enable fast accession
BLAST Basic Local Alignment Search Tool • General idea - a good alignment contains subsequences of high identity (local alignment): ACGCCCGGGAGCGC CTGGGCGTATAGCCC • First, identify (most efficiently) short almost exact matches . • Next, extended to longer regions of similarity. • Finally, optimize the alignment an exact algorithm. Altschulet al 1990
Similar to pairwise sequence alignments BLAST can be used for DNA/RNA (nucleotide) sequences or for proteins sequence (amino acids) • BLASTN(Nucleotide) • BLASTP(Protein)
DNA/RNA vs protein alphabet DNA(4) RNA(4) Protein (20) A T G C A U G C ACDEFGHIKLMNPQRSTVWY A T=A G…. A T=A G…. A G>>A W…. WHY is it different?
The 20 Amino Acids A G W
BLAST(Protein Sequence Example) 1. Identify (most efficiently) short almost exact matches between the query sequence and the database. Query sequence…FSGTWYA… Words of length 3: FSG, SGT, GTW, TWY, WYA
BLAST Preprocessing of the database Seq 1 FSGTWYA FSG, SGT, GTW, TWY, WAY Seq 2 FDRTSYV FDR, DRT, RTS, TSY, SYV Seq 3 SWRTYVA SWR, WRT,RTY, TYV, YVA ……. FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG.. SVT. GSW. TWF.. WYS…. Seq 1 BAG OF WORDS (BOW) Seq 102 Seq 3546
BLAST Query sequence …FSGTWYA… Words of length 3: FSG, SGT, GTW, TWY, WYA… DATABASE FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS…. SEQ N INVIEIAFDGTWTCATTNAMHEWASNINETEEN
BLAST 2. Extend word pairs as much as possible (No Gaps) until the local alignment score meets or exceeds a threshold or cutoffscore (t) HSP High-scoring Segment Pairs (HSPs) Q: FIRSTLINIHFSGTWYAAMESIRPATRICKREAD D: INVIEIAFDGTWTCATTNAMHEWASNINETEEN 3. Finally, optimize the alignment using an exact algorithm. Q= query sequence, D= sequence in database
Treating Gaps in BLAST >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Human mRNA CATGCGACTGACATCGATCATA BLAST by definition is a local alignment tool
Sometimes we want to include gaps in alignments! • Standard solution: affine gap model wx = g + r(x-1) wx : total gap penalty; g: gap open penalty; r: gap extend penalty ;x: gap length • Once-off cost for opening a gap • Lower cost for extending the gap • Changes required to algorithm
Running BLAST to predict a function of a new protein >Arrestin protein (C. elegance) MFIANNCMPQFRWEDMPTTQINIVLAEPRCMAGEFFNAKVLLDSSDPDTVVHSFCAEIKG IGRTGWVNIHTDKIFETEKTYIDTQVQLCDSGTCLPVGKHQFPVQIRIPLNCPSSYESQF GSIRYQMKVELRASTDQASCSEVFPLVILTRSFFDDVPLNAMSPIDFKDEVDFTCCTLPF GCVSLNMSLTRTAFRIGESIEAVVTINNRTRKGLKEVALQLIMKTQFEARSRYEHVNEKK LAEQLIEMVPLGAVKSRCRMEFEKCLLRIPDAAPPTQNYNRGAGESSIIAIHYVLKLTAL PGIECEIPLIVTSCGYMDPHKQAAFQHHLNRSKAKVSKTEQQQRKTRNIVEENPYFR
How to interpret a BLAST score: • The score is a measure of the similarity of the query to the sequence shown. How do we know if the score is significant? -Statistical significance -Biological significance
How to interpret a BLAST search: For each blast score we can calculate an expectation value (E-value) The expectation value E-value is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. page 105
BLAST- E value: Increases linearly with length of query sequence Decreases exponentially with score of alignment Increases linearly with length of database m = length of query ; n= length of database ; s= score • K ,λ: statistical parameters dependent upon scoring system and background residue frequencies
What is a Good E-value (Thumb rule) • E values of less than 0.00001 show that sequences are almost always related. • Greater E values, can represent functional relationships as well. • Sometimes a real (biological) match has an E value > 1 • Sometimes a similar E value occurs for a short exact match and long less exact match
How to interpret a BLAST search: • The score is a measure of the similarity of the query to the sequence shown. How do we know if the score is significant? -Statistical significance -Biological significance
(How) can we decide if two sequences really have the same function? Homolog = come from a common origin => have the same function
Homologous proteins = come from a common origin => have the same function Last Universal Common Ancestor
Homology Rule of thumb:-Proteins are homologous if 25%-35% identical -DNA sequences are homologous if 70% identical Can we always go by the rules?
Alignment between the worm and human arrestin VERY SIGNIFICANT , NOT HIGH IDENTITY
Assessing whether proteins are functional homologous High levels of a protein RBP4 (Retinol binding protein 4) and PAEP (pregnancy associated protein) were found to be correlated with pre-eclampsia High levels of a protein RBP4 (Retinol binding protein 4) were found to be correlated with childhood obesity RBP4= carrier of vitamin A in the blood PAEP= Pregnancy associated protein
Are they functionally homologous??? PAEP RBP4
Assessing whether proteins are functional homologous RBP4 (retinol binding) and PAEP (pregnancy protein) E value= 0.49; identity=24% Are they functionally homologous???
The lipocalins protein family (each dot is a protein) PAEP RBP4 retinol-binding protein odorant-binding protein apolipoprotein D
Are they functionally homologous??? PAEP RBP4 They belong to the same protein family= have a common ancestor Their functions have probably diverse