NCBI Molecular Biology Resources

NCBI Molecular Biology Resources A Field Guide Part 2 January 12, 2007

Web Access Text Entrez Sequence BLAST Structure VAST

Genomic Biology

Eukaryotic Projects

Mammals with Genomes

Links to Genomic Sequences Dog WGS project Dog chromosomes Armadillo WGS project

Canis familiaris Genome Project

Traditional GenBank Divisions; Keyword: WGS 12-digit Accessions (eg. AABC00000000) 424 Projects 299 Taxa 203 bacteria 91 eukaryotes 40 fungi 34 animals Whole Genome Shotgun Projects

From WGS to RefSeq ?

Chromosome 17 Assembly WGS contigs (NW) separated by gaps

A WGS Contig (NW) WGS WGS WGS NW NW NW NC

Let’s Do a Search! Search Gene for thyroid peroxidase (TPO) tpo [sym]ANDcanis familiaris [organism] [gene/protein name] (if [sym] doesn’t work) Only 1 record! Why not start all your NCBI searches this way?

Gene Annotation on a Chromosome (NC) thyroid peroxidase (TPO) exons mRNA CDS protein

Two Views of the Assembly NC_006599 Map Viewer NW WGS

Links from Gene Homologene

Beyond Gene If your organism is not in Gene… • WGS sequences in Entrez Nucleotide (wgs[prop]) • remember the armadillo! • UniGene : gene-based clusters of cDNAs and ESTs • Trace Archive

What is UniGene? A gene-oriented view of sequence entries • MegaBlast based automated sequence clustering • Now informed by genome hits • Nonredundant set of gene oriented clusters • Each cluster a unique gene • Information on tissue types and map locations • Includes known genes and uncharacterized ESTs • Useful for gene discovery and selection of mapping reagents

Organisms in UniGene Top Ten 1. Human 2. Mouse 3. Rat 4. Rice 5. Zebrafish 6. Wheat 7. Pig 8. Frog (X. tropicalis) 9. Corn 10. Chicken

Finding UniGene Clusters by link human TPO by Entrez search

UniGene Cluster for TPO

Trace Archive Find the link under Hot Spots on the Home Page New query builder!

Short-tailed opossum traces

Viewing Simple Genomes All are RefSeq NC records in Entrez Genome • Full chromosomal sequences are provided • Genes are annotated • The annotation can be shown graphically and linked to sequence records

NC_000913 in Entrez Genome Case sensitive! mutL

NC_000913 in Entrez Genome Links to Entrez Gene and Protein

Viewing Complex Genomes NCBI Map Viewer • Map Viewer Home Page • Shows all supported organisms • Provides links to genomic BLAST • Genome Overview Page • Provides links to individual chromosomes • Shows hits on a genome graphically • Chromosome Viewing Page • Allows interactive views of annotation details • Provides numerous maps unique to each genome

Map Viewer Home Page

Genome Overview Page Species-specific help! Search the maps Genomic BLAST

Chromosome Viewing Page Zooming Controls Map Summary Add or remove maps Master Map with exploded content Genes UniGene Contigs Ideogram

Map Summary Build 35 Build 36 TPO’s contig!

Map Content Map content varies greatly by species! • Sequence Maps • Core assembly • Annotation evidence • Clones & Markers • Polymorphisms • Links & Features • Genetic Maps • Cytogenetic maps • Linkage maps • Radiation hybrid maps Assembly Contig Component Transcript Gene

View the Assembly near TPO

Assembly of Chr. 2 NT_022221 1255072 3507187

Assembly of Chromosome 2 Links to Tools and Data Links to Entrez Gene Links to Entrez Nucleotide

Why do we need similarity searching? Searching with Sequences • To identify and annotate sequences with… • incomplete (or no) annotations (GenBank) • incorrect annotations • To assemble genomes • To explore evolutionary relationships by… • finding homologous molecules • developing phylogenetic trees • NOTE: Similar sequences may NOT have similar function!

Basic Local Alignment Search Tool • Widely used similarity search tool • Heuristic approach based on Smith Waterman algorithm • Finds best local alignments • Provides statistical significance • All combinations (DNA/Protein) query and database. • DNA vs DNA • DNA translation vs Protein • Protein vs Protein • Protein vs DNA translation • DNA translation vs DNA translation • www, standalone, and network clients

Seq 1 Seq 1 Seq 2 Seq 2 Global alignment Local alignment Global vs Local Alignment

Global vs. Local Alignment Align program (Lipman and Pearson) Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A + + + DL F K D+L I+ T+ W+ GR G IP+NYV + + +++ PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125 Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194 Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W ++ + ++L++ IG G+FG+V G + N VA Worm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264 Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L++ K++DFGL KE TG + P+KWTA Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401 Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471 Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE + Worm: 472 SDPDKRPTFETLQWKLEDL 492 human M--------------SAIQ----------------------AAWPSGT------------ECIAKYNFHG M S .. AA SG. . .A ... . worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1 20 40 60 440 450 human REQLEHI--------KTHELHL . .:: . : ... worm QWKLEDLFNLDSSEYKEASINF 500 BLASTp

GTACTGGACATGGACCCTACAGGAA Query: Word Size = 11 Nucleotide Words GTACTGGACAT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC GGACATGGACC GACATGGACCC ACATGGACCCT ........... Minimum word size = 7 blastn default = 11 megablast default = 28 Make a lookup table of words

GTQITVEDLFYNIATRRKALKN Query: Word Size = 3 Neighborhood Words LTV, MTV, ISV, LSV, etc. Protein Words GTQ TQI QIT ITV TVE VED EDL DLF ... Word Size can be 2 or 3 (default = 3) Make a lookup table of words

Initial Matches and Extensions Nucleotide BLAST requires one exact match ATCGCCATGCTTAATTGGGCTT <---CATGCTTAATT -----> exact word match one match Protein BLAST requires two neighboring matches within 40 aa GTQITVEDLFYNI <----SEIYYN ----> neighborhood words two matches

An alignment that BLAST can’t find 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

Nucleotide vs. Protein BLAST Comparing ADSS from H. sapiens and A. thaliana aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggc Human: N R V TV V L G A Q W G D E G + + V + V L G Q W G D E G A.th.: S Q V S G V L G C Q W G D E G agtcaagtatctggtgtactcggttgccaatggggagatgaaggt BLASTp finds three matching words BLASTn finds no match, because there are no 7 bp words Protein searches are generally more sensitive than nucleotide searches.

P P P P P P P P P P P P N P P P P P P P P P P P P P Translated BLAST ucleotide rotein Particularly useful for nucleotide sequences without protein annotations, such as ESTs or genomic DNA Program Query Database P N blastx P N tblastn N N tblastx

BLAST 2 Sequences (blastx) output: Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3 An Alignment BLAST Can Make Solution: compare protein sequences; BLASTX

Scoring Systems - Nucleotides Identity matrix A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 –3 +1 -3 T –3 –3 –3 +1 CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA

Scoring Systems - Proteins • Position Independent Matrices • PAM Matrices (Percent Accepted Mutation) • Derived from observation; small dataset of alignments • Implicit model of evolution • All calculated from PAM1 • PAM250 widely used • BLOSUM Matrices (BLOck SUbstitution Matrices) • Derived from observation; large dataset of highly conserved blocks • Each matrix derived separately from blocks with a defined percent identity cutoff • BLOSUM62 - default matrix for BLAST • Position Specific Score Matrices (PSSMs) • PSI- and RPS-BLAST

BLOSUM62 A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X Common amino acids have low weights Rare amino acids have high weights Negative for less likely substitutions Positive for more likely substitutions

Gapped Alignments • Gapping provides more biologically realistic alignments • Statistical behavior is not completely understood for gapped alignments • Gapped BLAST parameters must be found by simulations for each matrix • Affine gap costs = -(a+bk) • a = gap open penalty b = gap extend penalty • A gap of length 1 receives the score -(a+b)

Scores Simply add the scores for each pair of aligned residues V D S – C Y V E T L C F BLOSUM62 +4 +2 +1 -12 +9 +3 7 PAM30 +7 +2 0 -10 +10 +2 11 Different matrices produce different scores!

NCBI Molecular Biology Resources