980 likes | 1.15k Views
A Field Guide part 2. University of Colorado Health Sciences Center. August 30, 2005. Part 2. Entrez: text searching a GenBank record preview/index. BLAST: sequence searching pre-computed searches algorithms what’s new?. VAST: structure searching.
E N D
A Field Guidepart 2 University of Colorado Health Sciences Center August 30, 2005
Part 2 Entrez: text searching • a GenBank record • preview/index BLAST: sequence searching • pre-computed searches • algorithms • what’s new? VAST: structure searching • Example: mapping oligos to a genome
Header Feature Table Sequence GenBank Records The Flatfile Format
LOCUS NM_019570 4279 bp mRNA linear INV 28-OCT-2004 DEFINITION Mus musculus REV1-like(S. cerevisiae)(Rev1l),mRNA ACCESSION NM_019570 VERSION NM_019570.3 GI:50811869 KEYWORDS . = Title A Typical GenBank Record
GenBank Record: Feature Table, con’t. GenPept identifier
skip GenBank Record: sequence
[accn] [orgn] [mdat] [prop] Indexing for Nucleotide UID 59958365 FieldIndexed Terms [primary accession] NM_001012399 [title] Bos taurus hemochromatosis (hfe), mRNA. [organism] Bos taurus [sequence length] 1168 [modification date] 2005/02/19 [properties] biomol mrna gbdiv mam srcdb refseq
[Title] Entrez Nucleotide: HFE 137 records Not HFE
42 records Curated HFE splice variants (11 total) Smarter Query hfe[title] AND human[orgn]
hfe[title]ANDhuman[orgn] (con’t) Primary data
srcdb Preview/Index: Properties, srcdb Properties
Preview/Index: Properties, srcdb …AND srcdb refseq[Properties]
Preview/Index: Properties, srcdb …AND srcdb ddbj/embl/genbank[Properties]
Primate division gbdiv pri[prop] EST division gbdiv est[prop] Database Queries #1hfe 137 #2 hfe[title]AND human[orgn] 42 #3 #2 AND srcdb refseq[prop] 11 #4 #2 AND srcdb ddbj/embl/genbank[prop] 31 #5 #4 AND gbdiv pri[prop] 29 #4 #4 AND gbdiv est[prop] 2
Genomic DNA biomol genomic[prop] cDNA biomol mrna[prop] Molecule Queries #1hfe 116 #2 hfe[title]AND human[orgn] 42 #3 #2 AND biomol mrna[prop] 29 #4 #2 AND biomol genomic[prop] 13
Entrez Nucleotide Reviewed RefSeqs with transcript variants: srcdb refseq reviewed[prop]ANDtranscript[title] AND variant[title] Entrez Gene Topoisomerase genes from Archaea: topoisomerase[gene name]ANDarchaea[organism] Genes on human chromosome 2 with OMIM links 2[chromosome] ANDhuman[organism]AND“gene omim”[filter] Membrane proteins linked to cancer: “integral to plasma membrane”[gene ontology]ANDcancer[dis] More Queries… Fields are database-specific
Other Entrez Databases UniGene:rat clusters that have at least one mRNA rat[organism] NOT0[mrna count] SNP:uniquely mapped microsatellites on human chr2 microsat[SNP Class] AND 1[Map Weight] AND 2[Chromosome]) AND human[orgn] UniSTS:markers on the Genethon map of human chromosome 12 Genethon[Map Name] ANDhuman[organism] AND12[chromosome] Structure:structures of bacterial kinases with resolutions below 2 Å bacteria[organism]ANDkinaseAND000.00:002.00[resolution]
BLAST Web Searches, 2005 200,000
Nucleotide or protein:Related Sequences • BLAST link:BLink Precomputed BLAST Services • Transcript clusters:UniGene • Protein homologs: HomoloGene
Related Sequences Most similar Least similar
Best hits 3D structures CDD-Search BLink Output
Seq 1 Seq 1 Seq 2 Seq 2 Global alignment Local alignment Global vs Local Alignment
Global Seq1: 1 W--HEREISWALTERNOW 16 W HERE Seq2: 1 HEWASHEREBUTNOWISHERE 21 Local Seq1: 1 W--HERE 5 Seq1: 1 W--HERE 5 W HERE W HERE Seq2: 3 WASHERE 9 Seq2: 15 WISHERE 21 Global vs Local Alignment Seq1: WHEREISWALTERNOW (16aa) Seq2: HEWASHEREBUTNOWISHERE (21aa)
The Flavors of BLAST • Standard BLAST • nucleotide, protein and translations (blastn, blastp, blastx, tblastn, tblastx) • traditional “contiguous” word hit • Megablast • optimized for large batch searches • can use discontiguous words • PSI-BLAST • constructs PSSMs automatically; uses as query • very sensitive protein search • RPS BLAST • searches a database of PSSMs • tool for conserved domain searches “contiguous” discontiguous
Why Is BLAST So Popular? • Fast - heuristic approach based on Smith Waterman • Localalignments • Statisticalsignificance -Expect value • Versatile - blastn, blastp, blastx, tblastn, tblastx, rps-blast, psi-blast -www, standalone, and network clients
How BLAST Works • Make lookup table of “words” for query • Scan database for hits • Ungapped extensions of hits (initial HSPs) • Gapped extensions (no traceback) • Gapped extensions (traceback; alignment details)
GTACTGGACATGGACCCTACAGGAA Query: 11-mer Nucleotide Words GTACTGGACAT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC GGACATGGACC GACATGGACCC ACATGGACCCT Make a lookup table of words . . .
GTQITVEDLFYNIATRRKALKN Query: Word size = 3 (default) Word size can only be 2 or 3 Neighborhood Words LTV, MTV, ISV, LSV, etc. Protein Words GTQ TQI QIT ITV TVE VED EDL DLF ... Make a lookup table of words [ -f 11 = blastp default ]
Minimum Requirements for a Hit ATCGCCATGCTTAATTGGGCTT CATGCTTAATT one exact match • Nucleotide BLAST requires one exact match • Protein BLAST requires two neighboring matches within 40 aa GTQITVEDLFYNI SEI YYN neighborhood words [ -A 40 = blastp default ] two matches
example query words Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV… HFL 18 HFV 15 HFS 14 HWL 13 NFL 13 DFL 12 HWV 10 etc … YLS 15 YLT 12 YVS 12 YIT 10 etc … Neighborhood words Neighborhood score threshold T (-f) =11 Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47 YLSHFL Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333 +E YA YL K F+L +SP+ +DVNVHP+K V+++ I Gapped extension with trace back Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI-LEV… 50 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I + + Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEIATSI… 337 Final HSP BLASTP Summary High-scoring pair (HSP)
Scoring Systems - Nucleotides Identity matrix A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 –3 +1 -3 T –3 –3 –3 +1 [ -r 1 -q -3 ] CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA
Scoring Systems - Proteins • Position Independent Matrices • PAM Matrices (Percent Accepted Mutation) • Derived from observation; small dataset of alignments • Implicit model of evolution • All calculated from PAM1 • PAM250 widely used • BLOSUM Matrices (BLOck SUbstitution Matrices) • Derived from observation; large dataset of highly conserved blocks • Each matrix derived separately from blocks with a defined percent identity cutoff • BLOSUM62 - default matrix for BLAST • Position Specific Score Matrices (PSSMs) • PSI- and RPS-BLAST
F Negative for less likely substitutions Y Positive for more likely substitutions D D F BLOSUM62 A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X
PSSM scores 1 5 7 4 4 Position-Specific Score Matrix Serine/Threonine protein kinases catalytic loop DAF-1
Position-Specific Score Matrix A R N D C Q E G H I L K M F P S T W Y V 435 K -1 0 0 -1 -2 3 0 3 0 -2 -2 1 -1 -1 -1 -1 -1 -1 -1 -2 436 E 0 1 0 2 -1 0 2 -1 0 -1 -1 0 0 0 -1 0 0 -1 -1 -1 437 S 0 0 -1 0 1 1 0 1 1 0 -1 0 0 0 2 0 -1 -1 0 -1 438 N -1 0 -1 -1 1 0 -1 3 3 -1 -1 1 -1 0 0 -1 -1 1 1 -1 439 K -2 1 1 -1 -2 0 -1 -2 -2 -1 -2 5 1 -2 -2 -1 -1 -2 -2 -1 440 P -2 -2 -2 -2 -3 -2 -2 -2 -2 -1 -2 -1 0 -3 7 -1 -2 -3 -1 -1 441 A 3 -2 1 -2 0 -1 0 1 -2 -2 -2 0 -1 -2 3 1 0 -3 -3 0 442 M -3 -4 -4 -4 -3 -4 -4 -5 -4 7 0 -4 1 0 -4 -4 -2 -4 -1 2 443 A 4 -4 -4 -4 0 -4 -4 -3 -4 4 -1 -4 -2 -3 -4 -1 -2 -4 -3 4 444 H -4 -2 -1 -3 -5 -2 -2 -4 10 -6 -5 -3 -4 -3 -2 -3 -4 -5 0 -5 445 R -4 8 -3 -4 0 -1 -2 -3 -2 -5 -4 0 -3 -2 -4 -3 -3 0 -4 -5 446 D -4 -4 -1 8 -6 -2 0 -3 -3 -5 -6 -3 -5 -6 -4 -2 -3 -7 -5 -5 447 I -4 -5 -6 -6 -3 -4 -5 -6 -5 3 5 -5 1 1 -5 -5 -3 -4 -3 1 448 K 0 0 1 -3 -5 -1 -1 -3 -3 -5 -5 7 -4 -5 -3 -1 -2 -5 -4 -4 449 S 0 -3 -2 -3 0 -2 -2 -3 -3 -4 -4 -2 -4 -5 2 6 2 -5 -4 -4 450 K 0 3 0 1 -5 0 0 -4 -1 -4 -3 4 -3 -2 2 1 -1 -5 -4 -4 451 N -4 -3 8 -1 -5 -2 -2 -3 -1 -6 -6 -2 -4 -5 -4 -1 -2 -6 -4 -5 452 I -3 -5 -5 -6 0 -5 -5 -6 -5 6 2 -5 2 -2 -5 -4 -3 -5 -3 3 453 M -4 -4 -6 -6 -3 -4 -5 -6 -5 0 6 -5 1 0 -5 -4 -3 -4 -3 0 454 V -3 -3 -5 -6 -3 -4 -5 -6 -5 3 3 -4 2 -2 -5 -4 -3 -5 -3 5 455 K -2 1 1 4 -5 0 -1 -2 1 -4 -2 4 -3 -2 -3 0 -1 -5 -2 -3 456 N 1 1 3 0 -4 -1 1 0 -3 -4 -4 3 -2 -5 -2 2 -2 -5 -4 -4 457 D -3 -2 5 5 -1 -1 1 -1 0 -5 -4 0 -2 -5 -1 0 -2 -6 -4 -5 458 L -3 -1 0 -3 0 -3 -2 3 -4 -2 3 0 1 1 -2 -2 -3 5 -1 -3 catalytic loop
E = Kmne-S or E = mn2-S’ K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2 (applies to ungapped alignments) Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Expect Value E = number of database hits you expect to find by chance, ≥ S your score Alignments expected number of random hits Score (S) More info:www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
Gapped Alignments • Gapping provides more biologically realistic alignments • Gapped BLAST parameters are simulated for each scoring matrix • Affine gap costs = -(a+bk) • a = gap open penalty b = gap extend penalty • A gap of length 1 receives the score -(a+b)
An Alignment BLAST Cannot Make 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC Reason: no contiguous exact match of 7 bp.
Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3 BLAST 2 Sequences (blastx) output: An Alignment BLAST Can Make Solution: compare protein sequences; BLASTX
Other BLAST Algorithms • Megablast • Discontiguous Megablast • PSI-BLAST • PHI-BLAST
Megablast: NCBI’s Genome Annotator • Long alignments of similar DNA sequences • Greedy algorithm • Concatenation of query sequences • Faster than blastn; less sensitive
Too fast for you? MegaBLAST & Word Size Trade-off: sensitivity vs speed
blastp 3 2 WORD SIZE default minimum blastn 11 7 megablast 28 8 MegaBLAST & Word Size Trade-off: sensitivity vs speed