1 / 71

NCBI Molecular Biology Resources

NCBI Molecular Biology Resources. A Field Guide part 2. August 2-3, 2005. Web Access. Text. Entrez. Sequence. BLAST. Structure. VAST. Why do we need similarity searching?. Searching with Sequences. To identify and annotate sequences with… incomplete (or no) annotations (GenBank)

ruby
Download Presentation

NCBI Molecular Biology Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005

  2. Web Access Text Entrez Sequence BLAST Structure VAST

  3. Why do we need similarity searching? Searching with Sequences • To identify and annotate sequences with… • incomplete (or no) annotations (GenBank) • incorrect annotations • To assemble genomes • To explore evolutionary relationships by… • finding homologous molecules • developing phylogenetic trees • NOTE: Similar sequences may NOT have similar function!

  4. Basic Local Alignment Search Tool • Widely used similarity search tool • Heuristic approach based on Smith Waterman algorithm • Finds best local alignments • Provides statistical significance • All combinations (DNA/Protein) query and database. • DNA vs DNA • DNA translation vs Protein • Protein vs Protein • Protein vs DNA translation • DNA translation vs DNA translation • www, standalone, and network clients

  5. Seq 1 Seq 1 Seq 2 Seq 2 Global alignment Local alignment Global vs Local Alignment

  6. Global vs. Local Alignment Align program (Lipman and Pearson) Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A + + + DL F K D+L I+ T+ W+ GR G IP+NYV + + +++ PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125 Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194 Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W ++ + ++L++ IG G+FG+V G + N VA Worm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264 Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L++ K++DFGL KE TG + P+KWTA Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401 Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471 Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE + Worm: 472 SDPDKRPTFETLQWKLEDL 492 human M--------------SAIQ----------------------AAWPSGT------------ECIAKYNFHG M S .. AA SG. . .A ... . worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1 20 40 60 440 450 human REQLEHI--------KTHELHL . .:: . : ... worm QWKLEDLFNLDSSEYKEASINF 500 BLASTp

  7. GTACTGGACATGGACCCTACAGGAA Query: Word Size = 11 Nucleotide Words GTACTGGACAT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC GGACATGGACC GACATGGACCC ACATGGACCCT ........... Minimum word size = 7 blastn default = 11 megablast default = 28 Make a lookup table of words

  8. GTQITVEDLFYNIATRRKALKN Query: Word Size = 3 Neighborhood Words LTV, MTV, ISV, LSV, etc. Protein Words GTQ TQI QIT ITV TVE VED EDL DLF ... Word Size can be 2 or 3 (default = 3) Make a lookup table of words

  9. Initial Matches and Extensions Nucleotide BLAST requires one exact match ATCGCCATGCTTAATTGGGCTT <---CATGCTTAATT -----> exact word match one match Protein BLAST requires two neighboring matches within 40 aa GTQITVEDLFYNI <----SEIYYN ----> neighborhood words two matches

  10. An alignment that BLAST can’t find 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

  11. BLAST 2 Sequences (blastx) output: Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3 An Alignment BLAST Can Make Solution: compare protein sequences; BLASTX

  12. Scoring Systems - Nucleotides Identity matrix A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 –3 +1 -3 T –3 –3 –3 +1 CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA

  13. Scoring Systems - Proteins • Position Independent Matrices • PAM Matrices (Percent Accepted Mutation) • Derived from observation; small dataset of alignments • Implicit model of evolution • All calculated from PAM1 • PAM250 widely used • BLOSUM Matrices (BLOck SUbstitution Matrices) • Derived from observation; large dataset of highly conserved blocks • Each matrix derived separately from blocks with a defined percent identity cutoff • BLOSUM62 - default matrix for BLAST • Position Specific Score Matrices (PSSMs) • PSI- and RPS-BLAST

  14. BLOSUM62 A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X Common amino acids have low weights Rare amino acids have high weights Negative for less likely substitutions Positive for more likely substitutions

  15. Gapped Alignments • Gapping provides more biologically realistic alignments • Statistical behavior is not completely understood for gapped alignments • Gapped BLAST parameters must be found by simulations for each matrix • Affine gap costs = -(a+bk) • a = gap open penalty b = gap extend penalty • A gap of length 1 receives the score -(a+b)

  16. Scores Simply add the scores for each pair of aligned residues V D S – C Y V E T L C F BLOSUM62 +4 +2 +1 -12 +9 +3 7 PAM30 +7 +2 0 -10 +10 +2 11 Different matrices produce different scores!

  17. E = Kmne-S E = mn2-S’ K = scale for search space  = scale for scoring system S’ = bitscore = (S - lnK)/ln2 (applies to ungapped alignments) Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Expect Value E = number of database hits you expect to find by chance size of database your score Alignments expected number of random hits Score

  18. Advanced BLAST Options: Nucleotide Example Entrez Queries nucleotide all[Filter] NOT mammalia[Organism] green plants[Organism] biomol mrna[Properties] gbdiv est[Properties] AND rat[organism] Other Advanced –e 10000 expect value -v 2000 descriptions -b 2000 alignments

  19. Advanced BLAST Options: Protein Limit by taxon Mus musculus[Organism] Mammalia[Organism] Viridiplantae[Organism] Example Entrez Queries proteins all[Filter] NOT mammalia[Organism] green plants[Organism] srcdb refseq[Properties] Other Advanced –e 10000 expect value -v 2000 descriptions -b 2000 alignments • Matrix Selection • PAM30 -- most stringent • BLOSUM45 -- least stringent

  20. Low Complexity Filtering Filtered Unfiltered sp|P27476|NSR1_YEAST NUCLEAR LOCALIZATION SEQUENCE BINDING PROTEIN (P67) Length = 414 Score = 40.2 bits (92), Expect = 0.013 Identities = 35/131 (26%), Positives = 56/131 (42%), Gaps = 4/131 (3%) Query: 362 STTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLS---SQPQAIVTEDKTD 418 S++S SSS+S SS + + ++S + + S S S+ + E K Sbjct: 29 SSSSSESSSSSSSSSESESESESESESSSSSSSSDSESSSSSSSDSESEAETKKEESKDS 88

  21. Low Complexity Filter >gi|20140146|sp|Q96RF0|SNXI_HUMAN Sorting nexin 18 Length = 628 Score = 1048 bits (2710), Expect = 0.0 Identities = 528/628 (84%), Positives = 528/628 (84%) Query: 1 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR 60 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR Sbjct: 1 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR 60 Query: 61 XXXXXXXXXXXXXXXXXXXNVPPGGFEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSTFQ 120 NVPPGGFE STFQ Sbjct: 61 APEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPASFKPPPDAFQALLQPQQAPPPSTFQ 120 . . . low complexity sequence

  22. Neighbors: Precomputed BLAST Nucleotide Entrez Related Sequences produces a list of sequences sorted by BLAST score, but with no alignment details. Protein

  23. Blink – Protein BLAST Alignments • Lists only 200 hits • List is nonredundant

  24. Blink – Best Hits

  25. Megablast: NCBI’s Genome Annotator • Long alignments of similar DNA sequences • Greedy algorithm • Concatenation of query sequences • Faster than blastn; less sensitive

  26. MegaBLAST > 1133045 gnl|UG|Hs#S1133045 qd43b11.x1 Homo sapiens cDNA, 3' end CATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTG GTGAGAAGTGCTCGATTAGTTCAGACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCT TTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGT GACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCG TCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAAC CACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGC > 1141828 gnl|UG|Hs#S1141828 qv37f11.x1 Homo sapiens cDNA, 3' end GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGT GCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATA CATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAA GTCGTATCGATGT > 1145899 gnl|UG|Hs#S1145899 qv33c06.x1 Homo sapiens cDNA, 3' end GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGT GCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATA CATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAA GTCGTATCGATGT > 2291670 gnl|UG|Hs#S2291670 7e65f04.x1 Homo sapiens cDNA, 3' end TTTCATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGT TTGGTGAGAAGTGCTCGATTAGTTCAAACAACATCTGGCACTTGATGTCTGTCCTTCCCT CCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAA GGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACA CCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAA AACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGCCTCCCAACCGCATTC CTGCCTGTGTAGCAGGCGGTGAGCACCCAGAAGGGGCACATACCTCTCCAAGCCTTGAAA GCAAAGCATGGAGATCTACAAAAATAGGATTTCCACTTGGAGAAATGTCGCTGGGACAGT AI217550 AI251192 AI254381 BE645079 C:\seq\hs.4.fsa

  27. Discontiguous Megablast • Uses discontiguous word matches • Better for cross-species comparisons

  28. Templates for Discontiguous MegaBLAST W = 11, t = 16, coding: 1101101101101101 W = 11, t = 16, non-coding: 1110010110110111 W = 12, t = 16, coding: 1111101101101101 W = 12, t = 16, non-coding: 1110110110110111 W = 11, t = 18, coding: 101101100101101101 W = 11, t = 18, non-coding: 111010010110010111 W = 12, t = 18, coding: 101101101101101101 W = 12, t = 18, non-coding: 111010110010110111 W = 11, t = 21, coding: 100101100101100101101 W = 11, t = 21, non-coding: 111010010100010010111 W = 12, t = 21, coding: 100101101101100101101 W = 12, t = 21, non-coding: 111010010110010010111 Ma, B., Tromp, J., Li, M., "PatternHunter: faster and more sensitive homology search", Bioinformatics 2002 Mar;18(3):440-5

  29. Nucleotide vs. Protein BLAST Comparing ADSS from H. sapiens and A. thaliana aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggc Human: N R V TV V L G A Q W G D E G + + V + V L G Q W G D E G A.th.: S Q V S G V L G C Q W G D E G agtcaagtatctggtgtactcggttgccaatggggagatgaaggt BLASTp finds three matching words BLASTn finds no match, because there are no 7 bp words Protein searches are generally more sensitive than nucleotide searches.

  30. P P P P P P P P P P P P N P P P P P P P P P P P P P Translated BLAST ucleotide rotein Particularly useful for nucleotide sequences without protein annotations, such as ESTs or genomic DNA Program Query Database P N blastx P N tblastn N N tblastx

  31. Genomic BLAST • These pages provide customized nucleotide and protein databases for each genome • If a Map Viewer is available, the BLAST hits can be viewed on the maps

  32. BLAST the Chicken Genome Program Accession for human TPO mRNA

  33. BLAST Hit on the Genome

  34. BLASTn Hit on the Map Viewer

  35. TBLASTN Results Using NP_000538

  36. Linking Protein Sequence, Structure, and Function PSI-BLAST RPS-BLAST sequence  function (pfam, smart) sequence  structure + function (cd) BLASTp sequence  structure VAST structure  structure Structure

  37. Position Specific Substitution Rates Weakly conserved serine Active site serine

  38. Position Specific Score Matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3 Serine is scored differently in these two positions Active site nucleophile

  39. PSI-BLAST Create your own PSSM: Confirming relationships of purine nucleotide metabolism proteins BLOSUM62 PSSM query Alignment Alignment

  40. >gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE AMINOH MAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFLAKFDYY VIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDLVNQGLQ EQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYEGAVKNG RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAWDPKTTH VRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKKELLERLY PSI BLAST e value cutoff for PSSM

  41. PSI Results: Initial BLAST Run

  42. First PSSM Search Other purine nucleotide metabolizing enzymes not found by ordinary BLAST

  43. Third PSSM Search: Convergence Just below threshold, another nucleotide metabolism enzyme

  44. Entrez Domains (CDD) Single Domains Pfam-A seeds: HMM based models representing a wide variety of functional domains derived from SWISS-PROT Pfam pfam01234 Sanger HMM based models originally concentrating on eukaryotic signaling domains, now expanding SMART smart00123 EMBL NCBI curated domains based on sequence and structural alignments CD cd01234 NCBI Protein Families BLAST based alignments derived from complete proteomes of prokaryotes COG COG0123 NCBI

  45. Protein Links: Domains

  46. Results of a CD-Search Click on a colored bar to align your sequence to the CD CD SMART Pfam

  47. CDD Record – heme peroxidases red = high conservation blue = low conservation aligned query

  48. Curated CD Record Structural evidence Curated CDs (cd12345) are based on sequence and structure alignments Annotated features aligned query

  49. Blink: Sequence to Structure related structures

  50. Cn3D Related Structures

More Related