1.3k likes | 1.31k Views
Dive into microbial genomes, sequence alignment, markov chains, assemblies, annotations, and phylogenies. Understand pathogenomics and genetic variance through SNVs, SNPs, InDels, and Markov models.
E N D
MM7050Microbial PathogenesisBioinformatics Block Gary Van Domselaar Ph.D. Chief, Bioinformatics NML – PHAC Adjunct, Med Micro gary.vandomselaar@canada.ca
MM7050Microbial PathogenesisBioinformatics Block • Background • Genomes • Sequence Alignment • Markov Chains and Markov Models • Assemblies • De novo • Reference-guided • Genome Annotation • Regional Annotation • Functional Annotation • Phylogenies • Understanding Phylogenetic Trees • Building Phylogenetic Trees • Pathogenomics
Genomes • The entirety of an organism's hereditary information. • Encoded either in DNA or, for many types of virus, in RNA. • The genome includes both the genes and the non-coding sequences of the DNA/RNA.
Double Stranded DNA Antisense strand Sense strand Direct strand …TACCCCGGTCCGAATCCGGTAAATCCATTACCTATGCATGATTCCCATAAGACTAGGAT… …ATGGATTTACCGGATTCGGACCGGGGTAATCCGTCTTATGGGAATCATGCATATTGGTA… Opposite strand Sense strand Antisense strand
Sequence Alignments Pairwise Alignment Multiple Alignment
T H I S S E Q U E N C E | | | | | | | | | | T H A T S E Q U E N C E Sequence Alignment Sequence alignment aims to align sequences so that as many characters pairs are aligned The % identical characters is called the percent identity.
T H A T S E Q U E N C E | | | | T H I S I S A S E Q U E N C E Sequence Alignment Simple when sequences are similar, but what about when become more different?
T H A T S E Q U E N C E | | | | T H I S I S A S E Q U E N C E Sequence Alignment T H- - - - A T S E Q U E N C E | | | | | | | | | | | T H I S I S A-S E Q U E N C E To get around this problem gaps are inserted into the sequence
Sequence Alignment • When aligning protein sequences some amino acids (characters) are structurally similar to each other; they can substitute for each other and are well tolerated. Serine Threonine
Sequence Alignment • When aligning protein sequences some amino acids (characters) are structurally similar to each other; they can substitute for each other and are well tolerated. Serine Tryptophan
Sequence Alignment • Scoring matrices can be used to which amino acids are similar to each other
T H I SS E Q U E N C E | | | | | | | | | | T H A T S E Q U E N C E 5 8 0 1 1 4 5 5 6 9 5 4 Sequence Alignment Adding the similarity scores gives the overall alignment score. The percent similar + identical characters is called the % similarity.
Global & Local Alignments • Local alignments are useful when only a subregion of two sequences. • Global alignments are useful when you want to compare two closely related sequences that are roughly the same size.
Multiple Alignments • Alignment of two sequences is called pairwise alignment. • Alignment of more than two sequences is called multiple alignment.
SNVs and SNPs • A SNV (single nucleotide variant) is DNA sequence variation occurring when a single nucleotide differs between two or more genomes. • A SNP (single nucleotide polymorphism) is a SNV that is fixed in a population at an abundance of >1%. • SNP is often used to mean SNV. GTTACTGTCGTTGTAATACTCCAG GTTACTGTCGTTGTAATACTCCAG GTTACTGTCGTTGTAATAATCCAG GTTACTGTCGTTGTAATACTCCAG GTTACTGTCGTTGTAATAATCCAG GTTACTGTCGTTGTAATAATCCAG GTTACTGTCGTTGTAATACTCCAG GTTACTGTCGTTGTAATACTCCAG SNV/SNP
InDels • An InDel (insertion/deletion) is a region in one sequence that is inserted (or deleted) relative to another) • Can be a single base (InDel SNP) or can be larger GTTACTGTCGTTGTAATACTCCAG GTTACTGTCGTTGTAATACTCCAG GTTACTGTCGTTGTAATA-TCCAG GTTAC---CGTTGTAATACTCCAG GTTAC---CGTTGTAATA-TCCAG GTTACTGTCGTTGTAATA-TCCAG GTTACTGTCGTTGTAATACTCCAG GTTAC---CGTTGTAATACTCCAG InDel SNP InDel
Markov Chains • Often in bioinformatics we want see if a biological sequence encodes a certain biological feature such as a gene. Markov chains let us do that. • A Markov chain is a random probabilistic process that evolves from one state to another over a set of states. • Transition from one state to another occurs through a finite number of possible future states, and the probability distribution of transitioning to a future state depends only upon the current state. Transition probability State 2 State 1 11122121222212211112112112221121
Markov Chains: The Drunkard’s Walk A simple example of a Markov chain is the “drunkard’s walk” where at each step, the drunkard may stumble (transition) one step forward (F) or backwards (B) with some probability (0.5). Note that the transition probabilities depend only on the drunkard’s current position (his state), not how he arrived there. 0.5 B F 0.5 0.5 0.5
Markov Chains CGTTTCTCTTTAATCCCGCGCGCGCGCCTCGATATATCTCGCG • Example: CpG islands. A+ A- C+ C- S S E E G+ G- T+ T- See Durbin et al., Biological Sequence Analysis,. Cambridge 1998
Hidden Markov Models • Similar in concept to Markov Chains: • Set of States S = {s1, s2, …, sn} • Transition probability matrix T. • More powerful than Markov Chains. • Emission probability matrix E for each state. • Can’t determine state from observation alone(the state is “hidden”).
The Occasionally Dishonest Casino Player • Imagine that a casino player has two dice: • A fair die with P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6 • A loaded die, with P(1) = P(2) = P(3) = P(4) = P(5) = 1/10 and P(6) = 1/2 . • He switches dice randomly, without respect to the last time he switched. In other words, on every turn, he is equally likely to switch dice. • On average, he switches dice once every 20 turns • Given a sequence of rolls, when was the player rolling the fair die and when was he rolling the loaded die? 0.95 0.95 0.05 Fair Loaded P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|F) = 1/10 P(1|F) = 1/10 P(1|F) = 1/10 P(1|F) = 1/10 P(1|F) = 1/10 P(6|F) = 1/2 0.05
Hidden Markov Models • Example: CpG islands. CGTTTCTCTTTAATCCCGCGCGCGCGCCTCGATATATCTCGCG A+ C+ G+ T+ A- C- G- T-
HMM Training CGTTTCTCTTTAATCCCGCGCGCGCGCCTCGATATATCTCGCG --------------++++++++++++++-----------++++ A+ C+ G+ T+ A- C- G- T-
Profile HMMs Deletion GCAAGAGGCGA GCA-GAGCCGA --AAGAGCCGA GGGAGAGCCGA Match S E Insertion
Profile HMMs • Patterns can be used to describe a sequence • [GA]-[CTG]-[AG]-[AT]-G-A-G-C-C-G-A GCAAGAGGCGA GCATGAGCCGA ATAAGAGCCGA GGGAGAGCCGA
Profile HMMs 12345678901 GCAAGAGGCGA GCATGAG---A -TAAGAGCCGA GGGAGAGCCGA • A Profile (also called a position specific scoring matrix) is a more accurate way to model a group of sequences
How whole genome sequencing works DNA Extraction Bacterial Growth Sample collection Fragment Sequencing Raw Sequence Reads Prep DNA Library
NGS Read Types • Single-end, “shotgun” read ==---------- • sequence from one end of a single NGS library fragment • Paired-end read ==--------== • sequence from both ends of the samefragment • space between reads is NGS library "insert size" (< 800bp) • inward-facing sequences • Mate-pair read ==-------~~~~~~~~~~~~~~~~~~-------== • sequence both ends of a pseudo-fragment molecule • longer fragment sizes (3-10 kbp), much of which goes un-sequenced • Generated reads are for a pseudomolecule derived from non-contiguous sequences • Reads generated are generally outward-facing on source DNA molecule • Used to acquire genomic scaffolds / detect structural variation
What is an assembly? Multiple Identical Genome Copies Reads Contigs AACTCGCTCGCCTGCTTATCAACCGATCTACCCATGGATAC… ATAC… …AAAACTCGCTCGCCTGCTTATCAACCGATCTACCCATGGAT …AAAACTCGCTCGC …AAAACTCGCTCGCCTGCTTATCAACCGATCTACCCATGGATAC… CTCGCTCGCCTGCTTATCAACCGATCTACCCATGGATAC… …AAAACTCG CCATGGATAC… …AAAAC …AAAACTCGCTCGCCTGCTTATCAACCGATCTACCCATGGATAC…
De novo assembly ACGTAGCTTAGCTAGATC ||||| AGATCCTCGTCTCGTCGT Overlap Overlap • Calculate the sequence overlap between all available reads
De novo assembly Repeat Region GAGGGTAATCCGATAAGCGG GGTAATCCGATAAGCGGATATTAGAT TAATCCGATAAACGGATATTAGATT ATCCGATAAGCGGATATTAGATTAC ATCCGATAAACGGATATTAGATTACCT GATAAGCGGATATTAGATTACCTA Layout • Arrange the reads according to their pattern of overlap, producing a multiple alignment of the reads • Detect repeat regions, resolve ambiguities.
De novo assembly Repeat Region Consensus • Analyze the layout tilings to determine the most likely, or most highly represented consensus sequence. GAGGGTAATCCGATAAGCGG GGTAATCCGATAAGCGGATATTAGAT TAATCCGATAAACGGATATTAGATT ATCCGATAAGCGGATATTAGATTAC ATCCGATAAACGGATATTAGATTACCT GATAAGCGGATATTAGATTACCTA GAGGGTAATCCGATAAGCGGATATTAGATTACCTA
Reference mapping • Reference mapping assembles reads by aligning to a highly similar reference • Aligned reads are referred to as a ‘pileup’ • Good for clonal organisms. • Main application is in SNP discovery GAGGGTAATCCGATAAGCGGATATTAGATTACCT GGTAATCCGATAAAGGATATTAGAT TAATCCGATAAACGGATATTAGATT ATCCGATAAACGGATATTAGATTAC ATCCGATAAACGGATATTAGATTACCT GATAAACGGATATTAGATTACCTA GAGGGTAATCCGATAAACGGATATTAGATTACCTA
Part II: Genome Annotation Contigs Protein Encoding Genes Regional Annotation Non-protein encoding genes Functional Annotation Automated Annotation rRNA tRNA others Manual Annotation Annotated Genome Intergenic Scan
Prokaryotic Genome Organization • DNA is double stranded, genes can occur on both the forward(or direct) strand and the reverse (or opposite) strand. • Genes are densely packed • Genes can (partially) overlap. • Space between genes is called the intergenic region. • Prokaryotic genes are contiguous. • Genes can be coding (encode proteins) or noncoding (encode RNA elements).
Simple structure of a prokaryotic gene s70-type Promoter Transcriptional terminator TTGACA ---17bp---TATAAT---A UUUUUUU Shine-Dalgarno/ Ribosome Binding Site (RBS) AGGAGG Start Stop Regulatory sequences 1-500 nts 2-8 nts (~ 5 nts) variable coding sequence 3’ UTR 5’ UTR 5’ UTR
Open Reading Frames • The most basic characteristic of a gene is that it must contain an open reading frame (ORF). • ORFs begin with a start codon and end with an end codon. • Thestart codon is typically ATG coding for methionine located at the 5’ end. • In prokayotes there are exceptions, e.g. E. coli also uses GTG and TTG. • Codons occur in frame with the start codon until they reach the stop codon, one of TAA, TAG, TGA. • An ORF can only have a single stop codon right at the 3’ end. • Some ORFs contain coding sequences; some do not. Shine-Dalgarno Start Codon Stop Codon 5’ 3’ 5’ UTR Open Reading Frame 3’ UTR
ORF Detection • ORFs are detected ‘stop to stop’ in each frame. • True coding genes have little overlap, but many genes have some overlap • Coding prediction is the process of distinguishing true coding ORFs from noncoding ORFs, and identifying the correct start codon. +3 +2 +1 -1 -2 -3 http://manatee.sourceforge.net/jcvi/pdf/overview.pdf
Ribosome Binding Sites • A ribosomal binding site (RBS) is a sequence on mRNA that is bound by the ribosome when initiating protein translation. The RBS positions the ribosome at the start codon. • In prokayotes the RBS is called the Shine-Dalgarno sequence • The six-base consensus sequence is AGGAGG • Can be used to identify the start codon in a transcript • RBSFinder can be used to identify RBSes. RBS Start Codon ORF
GeneMark CGTGT GTGAT • GeneMark uses a Markov chain model to represent the statistics of coding and noncoding reading frames. • Uses dicodon (6-nt) statistics and 5th order Markov model (pentamers). • A training set is generated containing all possible pentameters from a large collection of known genes. • Non-coding sequences are assumed to have no bias, p(A/G/T/C) = .25 for all possible pentameters. CGTGA GTGAG ACGTG GTGAC CGTGC TGACA CGTGG GTGCC A portion of a 5th-order Markov chain. Successive states have labels shifted by one symbol.
GeneMark TAGGAGCGGGGCGATGATATCTAGGAGGAATACTATGCCAATCGGCTTTTACGTGCACCACCGCGGGCGGCTCATTGAGTTTAG TAGGA AGGAG GGAGC GAGCG ..... AGGAC GGATA AGGAT GGATA TAGGA AGGAC AGGAG GAGCG start AGGAT GGAGC … Which is highest? p(x|coding) or p(x|noncoding?)
GLIMMER Genome Sequence Data • GeneMark requires a large training databases, because there are 46=4096 possible dicodoncombinations, and each dicodon must occur enough times to get a reliable statistic. Also gene overlaps are not allowed. • GLIMMER tries to avoid this by selecting its own training set from the genome. • It uses 0th to 8th order Markov models • It starts highest to lowest to get the best # of observations (>400) from the training data • If it can’t get enough counts it models the likelihood from the ith-1 order model. This is called an interpolated Markov model. • Coding regions are predicted in the 3’5’ direction. This improves start site prediction. • Can handle overlapping ORFs • Highly accurate, can predict, >95% of genes • Note: GeneMark has been improved by the development of a hidden Markov model. It can predict overlaping ORFs and uses ribosome binding signals to predict the start codon. Find Long ORFs (>500 bases stop-to-stop) Create Training Set (interpolated Markov model) DistinguishCoding Orfs from NoncodingOrfs
tRNA and rRNA • In addition to protein coding sequences encoding mRNA, prokaryotic genomes encode transfer RNA (tRNA) and ribosomal RNA (rRNA). Both are involved in the process of mRNA translation. • tRNAs bind mRNA with its anticodon recognizing a specific mRNA codon; each anticodon specifies an attached amino acid. • rRNA is the RNA component of the ribosome providing peptidyltransferase activity. • rRNA forms two subunits, the large subunit (50S) and small subunit (30S). • The 50S subunit contains two rRNA species, the 5S and the 23S rRNAs. • The 30S subunit contains the 16S rRNA. • Bacterial 16S, 23S, and 5S rRNA genes are typically organized as a co-transcribed operon. • Bacterial genomes contain variable copies of the rRNA operon. • The 3’ end of 16S rRNA binds the Shine-Dalgarno sequence.