390 likes | 638 Views
GENERAL STUFF. subject: Genome-based Functional Annotation (bacteria) workload: 14 hrs - 2 hrs lecture - 12 hrs assignment (in 4 parts; so on average 3 hrs per part; not ready yet ) hand in: rtf-file, pdf-file or ppt-file before 8 November (later -1 point per day)
E N D
GENERAL STUFF subject: Genome-based Functional Annotation (bacteria) workload: 14 hrs - 2 hrs lecture - 12 hrs assignment (in 4 parts; so on average 3 hrs per part; not ready yet) hand in: rtf-file, pdf-file or ppt-file before 8 November (later -1 point per day) Christof Francke (Post-Doc/Scientist; TI Food and Nutrition)
Genome sequence annotation From DNA to function Bioinformatics Seminar, Nijmegen 16 10 2007 Christof Francke (Jos Boekhorst/ Michiel Wels)
Promised you a miracle promises, promises
Answering biological questions Why does Bacillus anthracis kill humans? (anthrax = miltvuur) B. anthracis We have the genomes, so now we know............?
When we have the genome sequenced, what do we know then/ what can we do then? Inventory: - predict functionality of encoded proteins - defects in genes (disease) - lineage - - - - - - - -
The quest for an appropriate translation of sequence to knowledge DNA sequencing (assembly) identifying genes Part I protein function prediction function reconstructionmodeling biology
Bacterial Genomics in Nijmegen Biological questions in the interest of Dutch Food Industry How can we improve the cell as a factory? - produce compounds - improve taste How can we prevent spoilage? - spores, biofilms, fungi How can we improve health? - interaction between bacteria and host (probiotics)
The organization of genetic information in bacteria Most Open Reading Frames are preceded by regulatory elements (cis-acting elements). promoter ORF AACGTTGACTGACGTGTCACGTCCCGTATATCGATGTCGTAGCTGATGGCGCGAAATCGATCGGTCGATATAGCGGCCGGATATCGCGATAGC A R - + RNA polymerase transcription mRNA RNA polymerase binding is affected by regulatory proteins (trans-acting elements; Activation, Repression).
The organization of genetic information in bacteria Operon Gene 2 Gene 3 Gene 1 mRNA Translation start Multiple Operons Regulated by the same Transcription Factor: Regulon Protein 1 Protein 2 Protein 3
Whole genome shotgun sequencing Fraser et al, Nature 2000 406: 799-803.
Wet lab Raw Data Production 4 x ABI 3700 sequencer >1.5 million nucleotides per day Bio-informatics Genome assembly Automated genome annotation In-house database, >5000 Blasts / Day I) The sequencing and assembly process Data Transfer
Genome assembly initially there are a lot of gaps
Methods for mapping contigs Figure 3 Sources of linking information between contigs. (A) overlaps, (B) clone mates, (C) alignments to reference genome, (D) alignments to physical maps, (E) conservation of gene synteny.
The first Dutch bacterial genome-sequence (2003) Proc Natl Acad Sci USA 100,1990
New technology: 454 sequencing Advantage: relatively fast, reliable and no sequence preference Disadvantage: short reads, difficult assembly Nowadays most sequencing efforts are hybrid
Identifying genes AGCGGTGTCGATCGGCGCTATAGCGCATGCGTATAGCGTATATCGATGTCGTAGCTGATGGCGCGAAATCGATCGGTCGATATAGCGGCCGGATATCGCGATATGCTATAGC
The identification of Open Reading Frames AGCGGTGTCGATCGGCGCTATAGCGCATGCGTATAGCGTATATCGATGTCGTAGCTGATGGCGCGAAATCGATCGGTCGATATAGCGGCCGGATATCGCGATATGCTATAGC TGTCGATCGGCGCTATAGCGCATGCGTATAGCGTATATCGATGTCGTAGCTGATGGCGCGAAATCGATCGGTCGATATAGCGGCCGGATATCGCATATGCTATAGCACGTTTG Different visualization: look at possible reading frames
Coding sequences characterized by: a) the Lack of stop codons
Leu : Ala : Trp random 6 : 4 : 1 coding 7 : 7 : 1 Characteristics of coding sequences: b) Codon usage In addition: codon bias!
Coding sequences characterized by: c) Signals in the promoter region Translation start: ATG (GTG, CTG) Ribosome Binding Site: GGGAAGG
GI_000001 GI_000002 Problems associate with Coding sequence recognition Problems: - many small putative CDS (cut-off) - deviations in start site - sequencing errors frameshifts
Strategies to find Coding sequences In practice, most gene finding programs use HMMs to predict protein encoding genes. • Train on a set of known genes: • Genes with a good database hit • Large genes with no overlap • Experimentally identified genes • …
Strategies to find Coding sequences Many different tools available: Glimmer2, GeneMark, EasyGene, FrameD, …… “Protein-coding regions in the genome sequence were identified using a combination of software tools including EasyGene [42], Glimmer [43] and FrameD [44].”
What is function? Inventory: - What can it do? - which conversions are catalized - which metabolites are transported - relates to physiology - depends on environment - with which component can it interact - - - - -
The attribute function is ambiguous context independent(molecular function or properties) - catalyze certain reactions - interact with certain proteins - bind to a specific DNA sequence context dependent (role) - act in a certain pathway - be a member of a certain protein complex(es) - act as a transcription factor (Chemistry/physics) (Biology/ physiology)
Gene Ontology Descriptors of molecular function Enzymatic conversions: EC-number (IUPAC) Transport: TC-number (Saier) Annotation using a controlled vocabulary (ontologies) In library and information science controlled vocabulary is a carefully selected list of words and phrases, which are used to tag units of information (document or work) so that they may be more easily retrieved by a search. Biopax
Genome Sequence and how it relates to function There are several properties of the translated and non-translated genome sequence that are identifiers of the function/role of a protein • Evolutionary conservation of sequence • Operon composition • Regulatory connections • Connections in the cellular network (molecular function) (biological role)
A1 B1 C1 A2 B2 C2a C2b Evolutionary conservation of sequence Homology as an indicator of functional similarity Orthologs: supposed identical molecular function Paralogs: supposed similar molecular function In-Paralogs: diverged (similar molecular function) homologs
Evolutionary conservation of sequence Strategy: to transfer annotation from experimentally verified ortholog/equivalent -> identify orthologs/equivalents
Determining evolutionary relations: Retrieving homologs BLAST: will yield similar sequences from database Example: map2 of L. plantarum In a simple case: one good hit per genome
Determining evolutionary relations Procedure: #Collect sequences and make multiple sequence alignment MUSCLE: muscle -in FASTA.txt –out FASTA.aln
Determining evolutionary relations: Alignments and Trees #Visualize multiple sequence alignment in CLUSTAL-X And check homogeneity (conserved features, little gaps) #Create bootstrapped NJ-tree (corrected for multiple substitutions)
Determining evolutionary relations: Use tree and gene context to infer orthology/equivalency Example: Lactobacillus plantarum has 4 maltose phosphorylase homologs kojibiose (Chaen et al. J. appl Glycosci 1999) trehalose (Inoue et al. Biosci. Biotechnol. Biochem 2002) maltose (Huwel et al. Enzyme Microb. Techn. 1997) maltose (Inoue et al. Biosci. Biotechnol. Biochem. 2001) LOFT R. vd Heijden et al. BMC Bioinformatics
P2 A S P1 Lactobacillus plantarum 0175 0180 map2 172 173 0445 0443 Lactobacillus gasseri 448 Bacillus subttilis 3456 map2/3 0606 Bacillus licheniformis map2/3 lacI PGPH Lactobacillus plantarum 1729 map3 0415 Lactobacillus brevis 365 Pediococcus pentosaceus 0536 0535 537 Leuconostoc mesenteroides 0017 0016 0144 0145 Leuconostoc mesenteroides 142 143 Evolutionary conservation of sequence Gene order conservation to identify functional equivalents
Molecular function versus Biological role Map2 and 3 identical molecular function But distinct biological roles
Coffee Break DNA sequencing (assembly) identifying genes Part I protein function prediction function reconstructionmodeling biology