880 likes | 1.19k Views
Lecture 5: Searching Sequence Databases Eric C. Rouchka, D.Sc. eric.rouchka@uofl.edu http://kbrin.a-bldg.louisville.edu/~rouchka/CECS694/. Multiple Alignment Formats. Formats for storing multiple alignments are specified FASTA, GCG MSF, ALN, etc. FASTA Format.
E N D
Lecture 5: Searching Sequence Databases Eric C. Rouchka, D.Sc. eric.rouchka@uofl.edu http://kbrin.a-bldg.louisville.edu/~rouchka/CECS694/ CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Multiple Alignment Formats • Formats for storing multiple alignments are specified • FASTA, GCG MSF, ALN, etc CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
FASTA Format • Each sequence begins with a description line ‘>’ • Sequence data follows, with gap character ‘-’ CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Example Fasta sequence >JC2395 NVSDVNLNK---YIWRTAEKMK---ICDAKKFARQHKIPESKIDEIEHNSPQDAAE---- -------------------------QKIQLLQCWYQSHGKT—GACQALIQGLRKANRCDI AEEIQAM >KPEL_DROME MAIRLLPLPVRAQLCAHLDAL-----DVWQQLATAVKLYPDQVEQISSQKQRGRS----- -------------------------ASNEFLNIWGGQYN----HTVQTLFALFKKLKLHN AMRLIKDY >FASA_MOUSE NASNLSLSK---YIPRIAEDMT---IQEAKKFARENNIKEGKIDEIMHDSIQDTAE---- -------------------------QKVQLLLCWYQSHGKS--DAYQDLIKGLKKAECRR TLDKFQDM CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Stockholm Format • Features in a multiple alignment are annotated using a ‘magic’ label • GF: Generic per-File annotation • GC: Generic per-Column annotation • GS: Generic per-Sequence annotation • GR: Generic per sequence and per column markup • Used by PFAM, HMMER, Belvu CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Example Stockholm Sequence • http://www.cgr.ki.se/cgr/groups/sonnhammer/Stockholm.html CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
GCG Multiple Sequence Format (MSF) • The beginning of the file is a header describing the sequences • Header may be formatted to contain specific information • Header ends with a “//” CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
!!AA_MULTIPLE_ALIGNMENT 1.0 msf MSF: 131 Type: P 22/01/02 CompCheck: 3003 .. Name: IXI_234 Len: 131 Check: 6808 Weight: 1.00 Name: IXI_235 Len: 131 Check: 4032 Weight: 1.00 Name: IXI_236 Len: 131 Check: 2744 Weight: 1.00 Name: IXI_237 Len: 131 Check: 9419 Weight: 1.00 // 1 50 IXI_234 TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT IXI_235 TSPASIRPPAGPSSR.........RPSPPGPRRPTGRPCCSAAPRRPQAT IXI_236 TSPASIRPPAGPSSRPAMVSSR..RPSPPPPRRPPGRPCCSAAPPRPQAT IXI_237 TSPASLRPPAGPSSRPAMVSSRR.RPSPPGPRRPT....CSAAPRRPQAT 51 100 IXI_234 GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAG IXI_235 GGWKTCSGTCTTSTSTRHRGRSGW..........RASRKSMRAACSRSAG IXI_236 GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSR..G IXI_237 GGYKTCSGTCTTSTSTRHRGRSGYSARTTTAACLRASRKSMRAACSR..G 101 131 IXI_234 SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE IXI_235 SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE IXI_236 SRPPRFAPPLMSSCITSTTGPPPPAGDRSHE IXI_237 SRPNRFAPTLMSSCLTSTTGPPAYAGDRSHE CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
ClustalW ALN Format • First non-blank line contains the word “CLUSTAL” • Each sequence starts with sequence name • Lines containing conservation symbols (* or :) are ignored CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
ClustalW ALN Format CLUSTAL W (1.82) multiple sequence alignment JC2395 -NVSDVNLNKYIWRTAEKMKICDAKKFARQHKIPESKIDEIEHNSPQDAAEQKIQLLQCW 59 FASA_MOUSE -NASNLSLSKYIPRIAEDMTIQEAKKFARENNIKEGKIDEIMHDSIQDTAEQKVQLLLCW 59 KPEL_DROME MAIRLLPLPVRAQLCAHLDALDVWQQLATAVKLYPDQVEQISSQKQRGRSASN-EFLNIW 59 : * *. : :::* :: .::::* :. :. : .: ::* * JC2395 YQSHGKTGACQALIQGLRKANRCDIAEEIQAM 91 FASA_MOUSE YQSHGKSDAYQDLIKGLKKAECRRTLDKFQDM 91 KPEL_DROME GGQYNHT--VQTLFALFKKLKLHNAMRLIKDY 89 .:.:: * *: ::* : :: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
ClustalW ALN Format CLUSTAL W(1.4) multiple sequence alignment IXI_234 TSPASIRPPA GPSSRPAMVS SRRTRPSPPG PRRPTGRPCC SAAPRRPQAT IXI_235 TSPASIRPPA GPSSR----- ----RPSPPG PRRPTGRPCC SAAPRRPQAT IXI_236 TSPASIRPPA GPSSRPAMVS SR--RPSPPP PRRPPGRPCC SAAPPRPQAT IXI_237 TSPASLRPPA GPSSRPAMVS SRR-RPSPPG PRRPT----C SAAPRRPQAT IXI_234 GGWKTCSGTC TTSTSTRHRG RSGWSARTTT AACLRASRKS MRAACSRSAG IXI_235 GGWKTCSGTC TTSTSTRHRG RSGW------ ----RASRKS MRAACSRSAG IXI_236 GGWKTCSGTC TTSTSTRHRG RSGWSARTTT AACLRASRKS MRAACSR--G IXI_237 GGYKTCSGTC TTSTSTRHRG RSGYSARTTT AACLRASRKS MRAACSR--G IXI_234 SRPNRFAPTL MSSCITSTTG PPAWAGDRSH E IXI_235 SRPNRFAPTL MSSCITSTTG PPAWAGDRSH E IXI_236 SRPPRFAPPL MSSCITSTTG PPPPAGDRSH E IXI_237 SRPNRFAPTL MSSCLTSTTG PPAYAGDRSH E CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Phylip • The first line is two numbers • First indicates number of sequences • Second indicates length of alignment CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Example Phylip File Phylip 3 92 JC2395 -NVSDVNLNK YIWRTAEKMK ICDAKKFARQ HKIPESKIDE IEHNSPQDAA FASA_MOUSE -NASNLSLSK YIPRIAEDMT IQEAKKFARE NNIKEGKIDE IMHDSIQDTA KPEL_DROME MAIRLLPLPV RAQLCAHLDA LDVWQQLATA VKLYPDQVEQ ISSQKQRGRS EQKIQLLQCW YQSHGKTGAC QALIQGLRKA NRCDIAEEIQ AM EQKVQLLLCW YQSHGKSDAY QDLIKGLKKA ECRRTLDKFQ DM ASN-EFLNIW GGQYNHT--V QTLFALFKKL KLHNAMRLIK DY CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
PIR Format >P1;JC2395 -NVSDVNLNKYIWRTAEKMKICDAKKFARQHKIPESKIDEIEHNSPQDAAEQKIQLLQCW YQHGKTGACQALIQGLRKANRCDIAEEIQAM * >P1;FASA_MOUSE NASNLSLSKYIPRIAEDMTIQEAKKFARENNIKEGKIDEIMHDSIQDTAEQKVQLLLCW YQHGKSDAYQDLIKGLKKAECRRTLDKFQDM * >P1;KPEL_DROME MAIRLLPLPVRAQLCAHLDALDVWQQLATAVKLYPDQVEQISSQKQGRSASN-EFLNIW GGQYNH--VQTLFALFKKLKLHNAMRLIKDY * CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
GDE %JC2395 nvsdvnlnkyiwrtaekmkicdakkfarqhkipeskideiehnspqdaaeqkiqllqcwy qshgktgacqaliqglrkanrcdiaeeiqam %FASA_MOUSE Nasnlslskyipriaedmtiqeakkfarennikegkideimhdsiqdtaeqkvqlllcwy qshgksdayqdlikglkkaecrrtldkfqdm %KPEL_DROME --mairllplpvraqlcahldaldvwqqlatavklypdqveqissqkqrgrsasneflni wggqynhtvqtlfalfkklklhnamrlikdy CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
NEXUS #NEXUS BEGIN DATA; dimensions ntax=3 nchar=91; format missing=? symbols="ABCDEFGHIKLMNPQRSTUVWXYZ“ interleave datatype=PROTEIN gap= -; matrix JC2395 NVSDVNLNKYIWRTAEKMKICDAKKFARQHKIPESKIDEIEHNSPQDAAE FASA_MOUSE NASNLSLSKYIPRIAEDMTIQEAKKFARENNIKEGKIDEIMHDSIQDTAE KPEL_DROME --MAIRLLPLPVRAQLCAHLDALDVWQQLATAVKLYPDQVEQISSQKQRG JC2395 QKIQLLQCWYQSHGKTGACQALIQGLRKANRCDIAEEIQAM FASA_MOUSE QKVQLLLCWYQSHGKSDAYQDLIKGLKKAECRRTLDKFQDM KPEL_DROME RSASNEFLNIWGGQYNHTVQTLFALFKKLKLHNAMRLIKDY ; end; CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
General Feature Format (GFF) • Developed for easy parsing of features • Used to: • Read annotations into ACE format • Print out images of sequences and annotations • http://www.sanger.ac.uk/Software/formats/GFF/ CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Sequence Formats • Numerous other formats • Descriptions: • BLOCKS Server • Http://www.blocks.fhcrc.org/blocks/help/blocks_format.html • EMBL accepted formats • http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Themes/SequenceFormats.html CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Sequence Conversion Programs • SEQIO • http://bioweb.pasteur.fr/docs/seqio/seqio.html • READSEQ • http://bimas.dcrt.nih.gov/molbio/readseq/ CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Searching Sequence Databases • Compare a query sequence against a target database • Return significant results • Possible Homolgous sequences • Yields insight into structure and function CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
DNA vs. Protein Searches • Easier to determine similarity in protein sequences • 4 base of DNA means more random sequences • Consider alignment of length 4 • DNA: 1/44 = 1/256 chance at random • AA: 1/204 = 1/160,000 chance at random CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
DNA vs. Protein Searches • Redundancy in Genetic code • Multiple codons code for same amino acid • A.A. sequence could be identical • DNA sequence could be different CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
DNA vs. Protein Searches • Consider the two sequences: AUGGAATTAGTTATTAGTGCTTTAATTGTTGAATAA AUGGAGCTGGTGATCTCAGCGCTGATCGTCGAGTGA • Ungapped DNA alignment: AUGGAATTAGTTATTAGTGCTTTAATTGTTGAATAA ||||| | || || || | || || || | | AUGGAGCTGGTGATCTCAGCGCTGATCGTCGAGTGA • 21 identical resides (out of 36) 58% identity CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
DNA vs. Protein Searches • Translate each to protein first: ELVISISALIVE ELVISISALIVE • 100% identical at amino acid level CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
DNA vs. Protein Searches • If nucleotide region contains a gene, beneficial to translate first • Target and query translated into all six reading frames • 3 in forward, 3 in reverse CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
DNA vs. Protein Searches • Number of comparisons needed grows • 4 comparisons: 2 in each direction • 36 comparisons: 6 in each direction • More sensitive, but slower CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Scoring Matrices • Defaults for major database searches • PAM250 (original) • BLOSUM62 CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
FASTA • First rapid database search utility • 50 times faster than Dynamic Programming • Based on a heuristic – not guaranteed to locate optimal solution CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
FASTA Algorithm • Hashing approach: • Construct a table showing each word of length k (k-tuple) for query and target • 1 or 2 for proteins • 4 or 6 for DNA • Relative positions calculated by subtracting positions • Matches in same phase are strung together CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
FASTA Algorithm • Identify 10 regions with highest density of hits • Trim regions to include only residues contributing to high scores • Associate init1 score to each region Each region is partial alignment without gaps CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
FASTA Algorithm • Join initial regions to form approximate alignments with gaps • Assign score • Sum of init1 scores for initial regions • Subtract gap penalty CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
FASTA Algorithm • Construct Needleman-Wunsch optimal alignment of the query and database • Consider only a band 32 residues wide • Centered on best initial region CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
FASTA Algorithm Step 1: Locate k-tuples Step 1: Locate k-tuples Step 2: locate 10 highest density regions (init1) Step 3: Join initial regions with gaps (initn) Step 1: Locate k-tuples Step 2: locate 10 highest density regions (init1) Step 1: Locate k-tuples Step 2: locate 10 highest density regions (init1) Step 3: Join initial regions with gaps (initn) Step 4: Align query and database around best region CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
FASTA Scores • FASTA calculates the u and parameters for the extreme value distribution • Results reported as normalized z-scores and E-values CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Steps to calculate z-score • Average score for databases sequences of same length range determined • Average score plotted against log of average sequence length CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Steps to calculate z-score • Points fitted to a straight line • A z score (number of standard deviations from fitted line) calculated for each score • High scoring and low scoring alignments removed • Steps repeated one or more times to refine CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Steps to calculate z-score • Z score normalized • Z’ = 50 + 10 * z • Alignment score with std dev of 5 has normalized z score of 100 • Significance can be refined by shuffling sequences in database CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Probability of z-score • Pearson, 2000 (ISMB): CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Expected Value • In a database of D sequences: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Example Fasta Output • FASTA reports a histogram • Indicates distribution of normalized scores • Expected to fall in normal distribution • Outliers are significant matches CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Histogram Output • normalized z’ score • Number of optimized scores found • Number of expected scores • “=“: approximate curve for observed • ‘*”: approximate curve for expected • Z’ score > 120 considered high-scoring CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
opt E() • < 20 188 0:== • 22 0 0: one = represents 109 library sequences • 24 0 0: • 26 2 1:* • 28 7 15:* • 30 28 91:* • 32 200 353:== * • 34 841 958:========* • 36 2217 1968:==================*== • 38 3746 3253:=============================*===== • 40 5360 4538:=========================================*======== • 42 6055 5547:==================================================*===== • 44 6496 6119:========================================================*=== • 46 5820 6232:====================================================== * • 48 5469 5966:=================================================== * • 50 4820 5444:============================================= * • 52 4202 4787:======================================= * • 54 3815 4089:=================================== * • 56 3271 3415:===============================* • 58 2755 2804:=========================* • 60 2268 2271:====================* • 62 1813 1821:================* • 64 1500 1448:=============* • 66 1233 1145:==========*= • 68 951 900:========* • 70 746 706:======* • 72 699 551:=====*= • 74 460 430:===*= • 76 337 335:===* • 78 287 260:==* • 80 244 202:=*= • 82 185 154:=* • 84 115 122:=* • 86 114 95:*= • 88 75 73:* inset = represents 1 library sequences • 90 70 57:* • 92 48 44:* :=======================================* • 94 26 34:* :========================== * • 96 33 26:* :=========================*======= • 98 14 20:* :============== * • 100 10 16:* :========== * • 102 7 12:* :======= * • 104 6 9:* :====== * • 106 5 7:* :===== * • 108 2 6:* :== * • 110 2 4:* :== * • 112 1 3:* := * • 114 0 3:* : * • 116 0 2:* : * • 118 0 2:* : * • >120 27 1:* :*========================== CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Fasta Best scoring hits • At most, one hit per sequence • Description, z’ score, init1 score, initn score, opt score, E value CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Example FASTA output The best scores are: initn init1 opt z-sc E(66345) MERR_PSEAE mercuric resistance operon regu ( 144) 928 928 928 1129.8 0 MERR_SHIFL mercuric resistance operon regu ( 144) 871 871 871 1061.3 0 MERR_SERMA mercuric resistance operon regu ( 144) 810 810 810 988.1 0 MERR_STAAU mercuric resistance operon regu ( 135) 292 172 298 373.6 3.5e-14 MERR_BACSR (strain rc607). mercuric resist ( 132) 241 198 289 363.0 1.4e-13 YHDM_ECOLI hypothetical transcriptional re ( 141) 175 175 276 347.0 1.1e-12 CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
FASTA Alignment Output • Smith-Waterman type alignment • ‘:’ denotes conserved residue • ‘.’ denotes conservative substitution (ie a substitution with a positive score) CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
FASTA Alignment Output >>MERR_STAAU mercuric resistance operon regulatory protei (135 aa) initn: 292 init1: 172 opt: 298 Z-score: 373.6 expect() 3.5e-14 Smith-Waterman score: 298; 36.923% identity in 130 aa overlap 10 20 30 40 50 60 MerR MENNLENLTIGVFAKAAGVNVETIRFYQRKGLLLEPDKPYGSIRRYGEADVTRVRFVKSA . :. .::: :: ::.:.:.::::. : . .. : :.: . ::::.: MERR_S MGMKISELAKACDVNKETVRYYERKGLIAGPPRNESGYRIYSEETADRVRFIKRM 10 20 30 40 50 70 80 90 100 110 MerR QRLGFSLDEIAELLRL--EDGTHCEEASSLAEHKLKDVREKMADLARMEAVLSELVCACH ..: ::: :: :. . .:: .:.. ... .: :....:. : :.. .: :: : MERR_S KELDFSLKEIHLLFGVVDQDGERCKDMYAFTVQKTKEIERKVQGLLRIQRLLEELKEKCP 60 70 80 90 100 110 120 130 140 MerR ARRGNVSCPLIASLQGGASLAGSAMP ... .::.: .:.:: MERR_S DEKAMYTCPIIETLMGGPDK 120 130 CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
FASTA Programs • FASTA – protein to protein OR DNA to DNA • TFASTA –query protein to DNA database • the DNA database is first translated in all six reading frames CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
FASTA Programs • FASTF – compares a set of ordered peptide fragments, obtained from analysis of a protein by cleavage and sequencing of protein bands resolved by electrophoresis, against a protein database • TFASTF – compares a set of ordered peptide fragments, against a DNA database CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
FASTA Programs • FASTS – compares a set of ordered peptide fragments, obtained from mass-spectometry analysis of a protein, against a protein database. • TFASTS – compares a set of ordered peptide fragments,, against a DNA database. >mgstm1 MGCEN,MIDYP,MLLAY,MLLGY CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
FASTA Programs • FASTX, FASTY – compares a query DNA sequence to a protein sequence database • DNA sequence translated in all six reading frames • frameshifts allowed CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka