1 / 35

Genome of the week

Explore the 4.2 Mb genome of Bacillus subtilis, a Gram-positive soil bacterium with industrial significance. Learn about genetic complexity, molecular functions, and gene annotation challenges.

tammyb
Download Presentation

Genome of the week

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome of the week Bacillus subtilis Gram-positive soil bacterium Genetically tractable, well-studied Developmental pathways (sporulation, genetic competence) Industrial and agricultural importance 4.2 Mb genome (sequence completed 1997) Close relative of Bacillus anthracis (Anthrax)

  2. B. subtilis genome features • 4,106 protein coding genes • 10 rRNA operons • Nearly 50% of the genome consists of paralogous genes. • 77 ABC transporter binding proteins • 10 phage like regions - horizontal transfer. Low GC regions in the genome. • 18 sigma factors - initiate transcription. • 34 two-component regulatory systems.

  3. Annotating genes • How to assign preliminary functions to genes. • Automated programs. • Similarity searches • BLAST and PSI-BLAST • COGs, Pfam, CDD, other databases • Only 50-75% of genes will have a predicted function. Some have no known homologs in any other genome. • Functional characterization (individual genes) • Gene knockouts • Overexpression

  4. In many cases computer annotation will only be able to predict function - NOT assign function! • The biological function of many genes have not been determined, even in model systems. • As genomic characterization of gene function continues - more and more computer generated annotations will be correct.

  5. Molecular function - activity of a protein at the molecular level. • Examples would be ATPase, metal binding, converting glucose-6-phosphate to fructose-6-phosphate. • Biological function - cellular role of the protein. • Examples would be translation initiation, DNA replication, glycolysis.

  6. Homologs, orthologs, and paralogs. • Homologous genes are genes that share a common evolutionary ancestor. • Orthologs are genes found in different organisms that arose from a common ancestor • Paralogs are genes found in the same organism that arose from a common ancestor. Duplication could have occurred in the species or earlier.

  7. Using BLAST to predict gene function. • BLAST predicted protein sequence against the non-redundant database. • Determine best hits • Automated annotation programs will often assign the best hit function to the gene being searched. • Must manually confirm automated annotations. (Final project).

  8. Basic Local Alignment Search Tool • Calculates similarity for biological sequences • Finds best local alignments • Heuristic approach based on Smith-Waterman algorithm • Searches for matching “words” rather than individual residues • Uses statistical theory to determine if a match might have occurred by chance NCBI Field Guide

  9. GTACTGGACATGGACCCTACAGGAA Query: Word Size = 11 Nucleotide Words GTACTGGACAT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC GGACATGGACC GACATGGACCC ACATGGACCCT ........... Minimum word size = 7 blastn default = 11 megablast default = 28 Make a lookup table of words NCBI Field Guide

  10. GTQITVEDLFYNIATRRKALKN Query: Word Size = 3 Neighborhood Words LTV, MTV, ISV, LSV, etc. Protein Words GTQ TQI QIT ITV TVE VED EDL DLF ... Word Size can be 2 or 3 (default = 3) Make a lookup table of words NCBI Field Guide

  11. Minimum Requirements for a Hit ATCGCCATGCTTAATTGGGCTT CATGCTTAATT exact word match one match • Nucleotide BLAST requires one exact match • Protein BLAST requires two neighboring matches within 40 aa GTQITVEDLFYNI SEIYYN neighborhood words two matches NCBI Field Guide

  12. Scoring Systems - Nucleotides Identity matrix A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 –3 +1 -3 T –3 –3 –3 +1 CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA NCBI Field Guide

  13. Scoring Systems - Proteins • Position Independent Matrices • PAM Matrices (Percent Accepted Mutation) • Derived from observation; small dataset of alignments • Implicit model of evolution • All calculated from PAM1 • PAM250 widely used • BLOSUM Matrices (BLOck SUbstitution Matrices) • Derived from observation; large dataset of highly conserved blocks • Each matrix derived separately from blocks with a defined percent identity cutoff • BLOSUM62 - default matrix for BLAST • Position Specific Score Matrices (PSSMs) • PSI- and RPS-BLAST NCBI Field Guide

  14. BLOSUM62 NCBI Field Guide A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X Common amino acids have low weights Rare amino acids have high weights Negative for less likely substitutions Positive for more likely substitutions

  15. Scores Simply add the scores for each pair of aligned residues V D S – C Y V E T L C F BLOSUM62 +4 +2 +1 -12 +9 +3 7 PAM30 +7 +2 0 -10 +10 +2 11 Different matrices produce different scores! NCBI Field Guide

  16. Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Expect Value E = number of database hits you expect to find by chance size of database your score At low E values E approximates a P value Alignments expected number of random hits Score NCBI Field Guide

  17. BLAST Databases for Proteins nr (non-redundant protein sequences) • GenBank CDS translations • NP_ RefSeqs • PIR, Swiss-Prot, PRF • PDB (sequences from structures) swissprot pat - patents pdb – sequences with 3D structures month – sequences updated within 30 days NCBI Field Guide

  18. Assessment of BLAST output • What is the level of identity and similarity of the best hits? • More identity - more likely the proteins may have similar functions. • Does the area of similarity occur over the entire protein? Or just part of the protein? (fig. 2.19) • Often you will find hits to only part of your protein. A GTP-binding domain for example. • Have any of the best hits been characterized experimentally? • With so many microbial genomes sequenced chances are you will have to search extensively to find a hit that has been characterized experimentally. NCBI Field Guide

  19. BLAST Formatting Page NCBI Field Guide

  20. BLAST Output: Graphic Overview PX SH3 NCBI Field Guide

  21. BLAST Output: Descriptions 4 X 10-68 links to entrez default e value cutoff = 10

  22. TaxBLAST: Taxonomy Reports

  23. BLAST Output: Alignments >gi|12643956|sp|Q9Y5X1|SNX9_HUMAN Sorting nexin 9 (SH3 and PX domain- containing protein 1) (SDP1 protein) Length = 595 Score = 255 bits (652), Expect = 4e-68 Identities = 140/322 (43%), Positives = 185/322 (56%), Gaps = 7/322 (2%) Query: 221 SSATVSRNLNRFSTFVKSGGEAFVLGEASGFVKDGDKLCVVLGPYGPEWQENPYPFQCTI 280 Sbjct: 197 SSSSMKIPLNKFPGFAKPGTEQYLL--AKQLAKPKEKIPIIVGDYGPMWVYPTSTFDCVV 254 Query: 281 DDPTKQTKFKGMKSYISYKLVPTHTQVPVHRRYKHFDWLYARLAEKF-PVISVPHLPEKQ 339 DP K +K G+KSYI Y+L PT+T V+ RYKHFDWLY RL KF I +P LP+KQ Sbjct: 255 ADPRKGSKMYGLKSYIEYQLTPTNTNRSVNHRYKHFDWLYERLLVKFGSAIPIPSLPDKQ 314 Query: 340 ATGRFEEDFISKRRKGLIWWMNHMASHPVLAQCDVFQHFLTCPSSTDEKAWKQGKRKAEK 399 TGRFEE+FI R + L WM M HPV+++ +VFQ FL + DEK WK GKRKAE+ Sbjct: 315 VTGRFEEEFIKMRMERLQAWMTRMCRHPVISESEVFQQFL---NFRDEKEWKTGKRKAER 371 SS+++ LN+F F K G E ++L A K +K+ +++G YGP W F C + NCBI Field Guide

  24. Blink – Protein BLAST Alignments • Lists only 200 hits • List is nonredundant NCBI Field Guide

  25. Nucleotide vs. Protein BLAST Comparing ADSS from H. sapiens and A. thaliana aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggc Human: N R V TV V L G A Q W G D E G + + V + V L G Q W G D E G A.th.: S Q V S G V L G C Q W G D E G agtcaagtatctggtgtactcggttgccaatggggagatgaaggt BLASTp finds three matching words BLASTn finds no match, because there are no 7 bp words Protein searches are generally more sensitive than nucleotide searches. NCBI Field Guide

  26. P P P P P P P P P P P P N P P P P P P P P P P P P P Translated BLAST ucleotide rotein Particularly useful for nucleotide sequences without protein annotations, such as ESTs or genomic DNA Program Query Database P N blastx P N tblastn N N tblastx

  27. Linking Protein Sequence, Structure, and Function Protein sequences Protein CDD: Conserved functional domains in proteins represented by a PSSM Domains PSI-BLAST, RPS-BLAST, CDART 3D Domains NCBI Field Guide

  28. Position Specific Substitution Rates Weakly conserved serine Active site serine

  29. Position Specific Score Matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3 Serine is scored differently in these two positions Active site nucleophile

  30. NCBI Field Guide PSI-BLAST Create your own PSSM: Confirming relationships of purine nucleotide metabolism proteins BLOSUM62 PSSM query Alignment Alignment

  31. >gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE AMINOH MAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFLAKFDYY VIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDLVNQGLQ EQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYEGAVKNG RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAWDPKTTH VRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKKELLERLY PSI BLAST e value cutoff for PSSM NCBI Field Guide

  32. PSI Results: Initial BLAST Run NCBI Field Guide

  33. First PSSM Search Other purine nucleotide metabolizing enzymes not found by ordinary BLAST NCBI Field Guide

  34. Third PSSM Search: Convergence Just below threshold, another nucleotide metabolism enzyme NCBI Field Guide

  35. Entrez Domains (CDD) 16,482 records Domains A Database of Position Specific Score Matrices NCBI Curated Alignments CDD 2% SMART 4% LOAD 0.3% • EMBL • HMM based models • originally concentrating • on eukaryotic signaling • domains, now expanding • NCBI • Library of Ancient Domains Pfam 35% KOG 29% • Sanger Center • Pfam-A seeds: • HMM based models • representing a wide • variety of functional • domains derived from • SWISS-PROT • NCBI • Eukaryotic COGs COG 30% • NCBI • BLAST based alignments derived from complete proteomes of unicelluar organisms NCBI Field Guide

More Related