410 likes | 583 Views
B ioinform atics Alignment of biological sequences - databases and software. LU, 204, Juris V i ksna. Praksē lietotās virkņu salīdzināšanas metodes. Smith-Waterman algoritms (1981) Lokālā salīdzināšana Lineāras gap penalties Lieto substitūciju matricas.
E N D
Bioinformatics Alignment of biological sequences - databases and software LU, 204,JurisViksna
Praksē lietotās virkņu salīdzināšanas metodes Smith-Waterman algoritms (1981) Lokālā salīdzināšana Lineāras gap penalties Lieto substitūciju matricas Vai laiks (n2) ir praktiski pieņemams? Proteīna garums - ap 100 aminoskābju... Divu proteīnu salīdzināšana: 10000 op. (0.01 sek, pie 1 MHz) Proteīnu datubāze - 1000000 ieraksti... Meklēšana datubāzē: 108 op, 100 sek Divu datubāzu salīdzināšana - 1016 op, 25 gadi :(
Heiristiskas metodes - FASTA • Hash table of short words in the query sequence • Go through DB and look for matches in the query hash (linear in size of DB) • K-tuple determines word size (k-tup 1 is single aa) • Lipman & Pearson 1985 VLICT = _ VLICTAVLMVLICTAAAVLICTMSDFFD [Adapted from M.Gerstein]
The FASTA Algorithm 4 steps: • use lookup table to find all identities at least ktup long, find regions of identities (Fig.1A) • rescan 10 regions (diagonals) with highest density of identities using PAM250 (Fig.1B) • join regions if possible without decreasing score below threshold (Fig.1C) • rescore ala Smith-Waterman 32 residues around initial region (Note: doesn’t save alignment) (Fig.1D)
FASTA Parameters • ktup = 2 for proteins, 6 for DNA • init1 Score after rescanning with PAM250 (or other) • initn Score after joining regions • opt Score after Smith-Waterman
FASTA algoritms [Fig.1 of Pearson and Lipman 1988]
FASTA algoritms [Adapted from D Brutlag]
Heiristiskas metodes - Blast • Altschul, S., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol.215, 403-410 • Indexes query and DB • Starts with all overlapping words from query • Calculates “neighborhood” of each word using PAM matrix and probability threshold matrix and probability threshold • Looks up all words and neighbors from query in database index • Extends High Scoring Pairs (HSPs) left and right to maximal length • Finds Maximal Segment Pairs (MSPs) between query and database • Blast 1 does not permit gaps in alignments [Adapted from M.Gerstein]
BLAST algoritms • Keyword search of all words of length w from the in the query of length n in database of length m with score above threshold • w = 11 for nucleotide queries, 3 for proteins • Do local alignment extension for each found keyword • Extend result until longest match above threshold is achieved • Running time O(nm) [Adapted from S.Daudenarde]
BLAST algoritms keyword Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD GVK 18 GAK 16 GIK 16 GGK 14 GLK 13 GNK 12 GRK 11 GEK 11 GDK 11 Neighborhood words neighborhood score threshold (T = 13) extension Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 +++DN +G + IR L G+K I+ L+ E+ RG++K Sbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263 High-scoring Pair (HSP) [Adapted from S.Daudenarde]
Original BLAST • Dictionary • All words of length w • Alignment • Ungapped extensions until score falls below some threshold • Output • All local alignments with score > statistical threshold [Adapted from S.Daudenarde]
Original BLAST: Example • w = 4 • Exact keyword match of GGTC • Extend diagonals with mismatches until score is under 50% • Output result: • GTAAGGTCC • GTTAGGTCC A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A From lectures by Serafim Batzoglou (Stanford)
Gapped BLAST: Example A C G A A G T A A G G T C C A G T • Original BLAST exact keyword search, THEN: • Extend with gaps in a zone around ends of exact match until score < threshold then merge nearby alignments • Output result: GTAAGGTCC-AGT GTTAGGTCCTAGT C T G A T C C T G G A T T G C G A From lectures by Serafim Batzoglou (Stanford)
BLAST - Programmas blastp compares an amino acid query sequence against a protein sequence database blastn compares a nucleotide query sequence against a nucleotide sequence database blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence database tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx is extremely slow and cpu-intensive
Proteīnu virkņu formāti - FASTA >sp|P00156|CYB_HUMAN Cytochrome b - Homo sapiens (Human). MTPMRKINPLMKLINHSFIDLPTPSNISAWWNFGSLLGACLILQITTGLFLAMHYSPDAS TAFSSIAHITRDVNYGWIIRYLHANGASMFFICLFLHIGRGLYYGSFLYSETWNIGIILL LATMATAFMGYVLPWGQMSFWGATVITNLLSAIPYIGTDLVQWIWGGYSVDSPTLTRFFT FHFILPFIIAALATLHLLFLHETGSNNPLGITSHSDKITFHPYYTIKDALGLLLFLLSLM TLTLFSPDLLGDPDNYTLANPLNTPPHIKPEWYFLFAYTILRSVPNKLGGVLALLLSILI LAMIPILHMSKQQSMMFRPLSQSLYWLLAADLLILTWIGGQPVSYPFTIIGQVASVLYFT TILILMPTISLIENKMLKWA
Proteīnu virkņu formāti - SwissProt ID CYB_HUMAN STANDARD; PRT; 380 AA. AC P00156; Q34786; Q8HBR6; Q8HNQ0; Q8HNQ1; Q8HNQ9; Q8HNR4; Q8HNR7; AC Q8W7V8; Q8WCV9; Q8WCY2; Q8WCY7; Q8WCY8; Q9B1A6; Q9B1B6; Q9B1B8; AC Q9B2X7; Q9B2X9; Q9B2Y3; Q9B2Z0; Q9B2Z4; Q9T6H6; Q9T9Y0; Q9TEH4; DT 21-JUL-1986, integrated into UniProtKB/Swiss-Prot. DT 21-JUL-1986, sequence version 1. DT 19-SEP-2006, entry version 75. DE Cytochrome b. GN Name=MT-CYB; Synonyms=COB, CYTB, MTCYB; OS Homo sapiens (Human). OG Mitochondrion. OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; OC Catarrhini; Hominidae; Homo. OX NCBI_TaxID=9606; RN [1] RP NUCLEOTIDE SEQUENCE. RC TISSUE=Placenta; RX MEDLINE=81173052; PubMed=7219534 [NCBI, ExPASy, EBI, Israel]; RA Anderson S., Bankier A.T., Barrell B.G., de Bruijn M.H.L., RA Coulson A.R., Drouin J., Eperon I.C., Nierlich D.P., Roe B.A., RA Sanger F., Schreier P.H., Smith A.J.H., Staden R., Young I.G.; RT "Sequence and organization of the human mitochondrial genome."; RL Nature 290:457-465(1981).
Proteīnu virkņu formāti - SwissProt FT VARIANT 360 360 T -> A. FT /FTId=VAR_013667. FT VARIANT 368 368 T -> I. FT /FTId=VAR_013668. SQ SEQUENCE 380 AA; 42730 MW; 90097B07FF2C1FD8 CRC64; MTPMRKINPL MKLINHSFID LPTPSNISAW WNFGSLLGAC LILQITTGLF LAMHYSPDAS TAFSSIAHIT RDVNYGWIIR YLHANGASMF FICLFLHIGR GLYYGSFLYS ETWNIGIILL LATMATAFMG YVLPWGQMSF WGATVITNLL SAIPYIGTDL VQWIWGGYSV DSPTLTRFFT FHFILPFIIA ALATLHLLFL HETGSNNPLG ITSHSDKITF HPYYTIKDAL GLLLFLLSLM TLTLFSPDLL GDPDNYTLAN PLNTPPHIKP EWYFLFAYTI LRSVPNKLGG VLALLLSILI LAMIPILHMS KQQSMMFRPL SQSLYWLLAA DLLILTWIGG QPVSYPFTII GQVASVLYFT TILILMPTIS LIENKMLKWA
Proteīnu virkņu datubāzes - UniProt http://www.expasy.uniprot.org/
Proteīnu virkņu datubāzes - PIR http://pir.georgetown.edu/
Proteīnu virkņu datubāzes - SwissProt http://www.expasy.org/sprot/
Proteīnu struktūru datubāzes - PDB http://www.pdb.org
Salīdzināšanas programmas - BLAST http://www.ncbi.nlm.nih.gov/BLAST/ http://www.ebi.ac.uk/blastall/
Salīdzināšanas programmas - FASTA http://fasta.bioch.virginia.edu/ http://www.ebi.ac.uk/fasta33/
Salīdzināšanas programmas - Smith-Waterman http://www.ebi.ac.uk/MPsrch/
Salīdzinājumu novērtēšana S = Total Score S(i,j) = similarity matrix score for aligning i and j Sum is carried out over all aligned i and j n = number of gaps (assuming no gap ext. penalty) G = gap penalty Simplest score (for identity matrix) is S = # matches What does a Score of 10 mean? What is the Right Cutoff? [Adapted from M.Gerstein]
Assessing sequence similarity • Need to know how strong an alignment can be expected from chance alone • “Chance” is the comparison of • Real but non-homologous sequences • Real sequences that are shuffled to preserve compositional properties • Sequences that are generated randomly based upon a DNA or protein sequence model (favored) [Adapted from S.Daudenarde]
Model Random Sequence • Necessary to evaluate the score of a match • Take into account background • Adjust for G+C content • Poly-A tails • “Junk” sequences • Codon bias [Adapted from S.Daudenarde]
E-Values E-Value Expectation value cutoff. An E-Value of 1 means that one would expect to find at most 1 alignment with a score as high as reported in a search of the given database. Settingthe E-value higher will report more sequencematches Resp., jo mazāka, jo labāk... Rēķina apmēram šādi: E = mn2-S’, kur m,n - virkņu garumi, S' - rēķina, piem., šādi: S’ = lS–ln(K)ln(2)
P-Values P-Value Varbūtība, ka šāds (vai labāks) rezultāts tiks iegūts "nejaušam" proteīnam Arī, jo mazāka, jo labāk...
P-Values • According to Poison’s distribution, the probability of finding b HSPs with a scoreS is given by: • (e-EEb)/b! • For b = 0, that chance is: • e-E • Thus the probability of finding at least one such HSP is: • P = 1 – e-E [Adapted from S.Daudenarde]
P-Values • P(s > S) = .01 • P-value of .01 occurs at score threshold S (392 below) where score s from random comparison is greater than this threshold 1% of the time • Likewise for P=.001 and so on. [Adapted from M.Gerstein]
Choose between local or global search algorithms Use most sensitive search algorithm available Original BLAST for no gaps Smith-Waterman for most sensitivity FASTA with k-tuple 1 is a good compromise Gapped BLAST for well delimited regions PSI-BLAST for families Initially BLOSUM62 and default gap penalties If no significant results, use BLOSUM30 and lower gap penalties FASTA cutoff of .01 Blast cutoff of .0001 Examine results between exp. 0.05 and 10 for biological significance Ensure expected score is negative Beware of hits on long sequences or hits with unusual aa composition Reevaluate results of borderline significance using limited query region Segment long queries ³ 300 amino acids Segment around known motifs General Protein Search Principles (some text adapted from D Brutlag)
ROC diagrammas Uzskatīsim, ka proteīni ir homologi, ja “līdzība” parsniedz kaut kadu slieksni t true positives (tp) - s(a,b) t un a, b ir homologi false positives (fp) - s(a,b) t un a, b nav homologi true negatives (tn) - s(a,b) < t un a, b nav homologi false negatives (fn) - s(a,b) < t un a, b ir homologi Sensitivity = tp/n = tp/(tp+fn) Specificity = tn/n = tn(tn+fp)
ROC diagrammas 100% 100% Coverage (roughly, fraction of sequences that one confidently “says something” about) Thresh=10 Thresh=20 [sensitivity=tp/n=tp/(tp+fn)] Thresh=30 Different score thresholds Error rate (fraction of the “statements” that are false positives) Two “methods” (red is more effective) [Specificity = tn/n =tn/(tn+fp)] error rate = 1-specificity = fp/n [Adapted from M.Gerstein]