Where In A Genome Does DNA Replication Begin?

Bioinformatics Where In A Genome Does DNA Replication Begin? Slides are based on:Bioinformatics Algorithms: an Active Learning Approach, 2th edition by:Phillip Compeau and Pavel Pevzner Lecture #11 By: Dr. Emad Nabil Fall 2019 FCI-CU

How to search in all DNA for the most frequent k-mer Plan of study • Finding Oric (≈500 nucleic acids) in Lab Biologically for certain bacteria Vibrio cholerae • Develop an algorithm to find what is special in this region (≈500 NA)(i.e. search for secret codes) • Consider performance, search for better algorithm • How can we search in all DNA genome for that secret messages using the previous developed algorithms (we don’t know the region where we should search for hidden messages, so we will search in all DNA)

Finding Replication Origin • Our strategy BEFORE: given a previously knownoriC(a 500-nucleotide window) • find frequent words (clumps) in oriC as candidate DnaA boxes. • replication origin →frequent words

Finding Replication Origin • Our strategy BEFORE: given a previously knownoriC (a 500-nucleotide window) • find frequent words (clumps) in oriC as candidate DnaA boxes. • replication origin →frequent words But what if the position of the replication origin within a genome is unknown!

Find Patterns Forming Clumps in a String Using sliding window L L L Ans so on…. The window length L = 500 reflects the typical length of oriC in bacterial genomes.

Clumps of Hidden Messages

Outline • Search for Hidden Messages in Replication Origin • What is a Hidden Message in Replication Origin? • Some Hidden Messages are More Surprising than Others • Clumps of Hidden Messages • From a Biological Insight toward an Algorithm for Finding Replication Origin • Asymmetry of Replication • Why would a computer scientist care about assymetry of replication? • Skew Diagrams • Finding Frequent Words with Mismatches • Open Problems

What is a Clump? We define a k-mer as a "clump(L,t)" if if it appears many times (t) in a short region of length (L) inside a genome. • ATGATCAAG forms a (500,3)-clump in the Vibrio cholerae genome. • CCTACCACC forms a (500,3)-clump in the Thermotogapetrophilagenome. • The window length L = 500 reflects the typical length of oriC in bacterial genomes. TGCA forms a (25,3)-clump in the Genome: gatcagcataagggtccCTGCAATGCATGACAAGCCTGCAGTtgttttac In other words, it occurs 3 times in a window of 25 characters

1E Find Patterns Forming Clumps in a String Using sliding window Genome length =100 Window size= 75 5-mer found in 75 chars 4 times In the example above a clump is: k-mer appears (5) t times in region of (4) L windowlength in genome (75) Find All distinct k-mers found t times in L chars http://rosalind.info/problems/ba1e/ Testing Dataset: http://bioinformaticsalgorithms.com/data/extradatasets/replication/clump_finding.txt

What we done so far is: Finding k-mer that form “clump(L,t)” A region of length L that contains some k-merst times. This region should be the Oric atcaatgatcaacgtaagcttctaagcATGATCAAGgtgctcacacagtttatccacaacctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgaccacggaaagATGATCAAGagaggatgatttcttggccatatcgcaatgaatacttgtgacttgtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggattacgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttaggatagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaattgataatgaatttacatgcttccgcgacgatttacctCTTGATCATcgatccgattgaagatcttcaattgttaattctcttgcctcgactcatagccatgatgagctCTTGATCATgtttccttaaccctctattttttacggaagaATGATCAAGctgctgctCTTGATCATcgtttc Remember oriC of Vibrio cholerae Most frequent 9-mers in this oriC (all appear 3 times): ATGATCAAG, CTTGATCAT

Now are we finished ?

Vibrio-cholerae genome size is 1.05 MB. E.coli genome size is around 4.6 MB. There exist 1904 different 9-mers forming (500,3)-clumps in E. coli genome.It is absolutely unclear which of them point to the replication origin…

Outline Search for Hidden Messages in Replication Origin What is a Hidden Message in Replication Origin? Some Hidden Messages are More Surprising than Others Clumps of Hidden Messages From a Biological Insight toward an Algorithm for Finding Replication Origin Asymmetry of Replication Why would a computer scientist care about assymetry of replication? Skew Diagrams Finding Frequent Words with Mismatches Open Problems

Now what? • Maybe we need to learn more about the details of the replication process. • Hopefully that these details provide new algorithmic insights into finding oriC. Ask biologist for help

Important notes about directions Replication (Creation of DNA) the template strand of DNA are read from 3’ to 5’ direction (the opposite direction) Transcription (producing of RNA) the template strand of DNA are read from 3’ to 5’ direction (the opposite direction) Translation (Producing of Proteins) mRNAs templates are read in the 5´ to 3´ direction (the normal direction)

DNA Polymerase DNA polymerases are unidirectional:they can only traverse a parent strand in the opposite (3’  5’) direction. Description: Depiction of DNA replication with replication fork, strands and okazaki-fragments. a: template strands, b: leading strand, c: lagging strand, d: replication fork, e: RNA primer, f: Okazaki fragment

Cut the genome from the terminus

F: forward R: reverse Cut the genome from the terminus R F F R

F: forward R: reverse The reverse half-strand lives a double-stranded life most of the time. The forward half-strand spends a large portion of its life single-stranded, waiting to be replicated. R F F R

F: forward R: reverse FACT:Single-stranded DNA has a much higher mutation rate than double-stranded DNA. Chemically: (C) has a tendency to mutate (T) through a process called de-amination. So, it is very normal to be fewer in the forward half-strands, that remains single-stranded for longer time. R F F R

F: forward R: reverse FACT:Single-stranded DNA has a much higher mutation rate than double-stranded DNA. Chemically: (C) has a tendency to mutate (T) through a process called de-amination. So, it is very normal to be fewer in the forward half-strands, that remains single-stranded for longer time. Low C R F F R Low C

F: forward R: reverse FACT:Single-stranded DNA has a much higher mutation rate than double-stranded DNA. Chemically: (C) has a tendency to mutate (T) through a process called de-amination. So, it is very normal to be fewer in the forward half-strands, that remains single-stranded for longer time. Low G Low C R F F R Low C low G

F: forward R: reverse FACT:Single-stranded DNA has a much higher mutation rate than double-stranded DNA. Chemically: (C) has a tendency to mutate (T) through a process called de-amination. So, it is very normal to be fewer in the forward half-strands, that remains single-stranded for longer time. Low G High C Low C R F F R Low C low G High C

F: forward R: reverse FACT:Single-stranded DNA has a much higher mutation rate than double-stranded DNA. Chemically: (C) has a tendency to mutate (T) through a process called de-amination. So, it is very normal to be fewer in the forward half-strands, that remains single-stranded for longer time. high G Low G High C Low C R F F R Low C low G High G High C

F: forward R: reverse Considering a single strand Gs-Cs For the above strand Gs-Cs For the above strand R F Low C high G Low G High C F R Low C low G High G High C • The difference (Gs-Cs) is decreasing on the reverse half-strand, but (Gs-Cs) is increasing on the forward half-strand (that remains single for more time). • Traverse the genome, running total of Gs-Cs. • If this difference starts increasing, then we guess that we are on the forward half-strand; • if this difference starts decreasing, then the we guess that we are on the reverse half-strand. • At some point, If Gs-Cs counts just switched its behavior from decreasing to increasing: • You are at the oriC’s beginning of this genome.

Considering a single strand F: forward R: reverse Gs-Cs For the above strand Gs-Cs For the above strand R F Low C high G Low G High C F R Low C low G High G High C

E.coli Skew Diagram Actual oriC E. Coli bacteria

Skew Diagram • Skew (Gs-Cs continuous difference) for DNA: CATGGGCATCGGCCATACGCC Oric

Skew Diagram Skew(k):#Gs - #Cs for the first k nucleotides of Genome. Skew diagram: Plot Skew(k) against k Since we don't know the location of oriCin a circular genome, let's linearize it (i.e., select an arbitrary position and pretend that the genome begins here), resulting in a linear string Genome.

Considering the outer strand

How to download a genome for a certain species

How to download a genome for a certain speciesExample : Vibrio Cholerae • The genome of V. Cholerae consists of two circular chromosomes of 2,961,146 bp and 1,072,314 bp. • The vast majority of recognizable genes for essential cell function and pathogenicity are located on the large chromosome.

https://www.ncbi.nlm.nih.gov/ How to download a genome for a certain speciesExample : Vibrio Cholerae

Find a Position(s) in a Genome Minimizing the Skew 1F Why? These 2 indices have the same minimum skew (Gs-Cs) among all indices. http://rosalind.info/problems/ba1f/ Testing dataset: http://bioinformaticsalgorithms.com/data/extradatasets/replication/minimum_skew.txt

Skew Diagram of E. Coli: Where is the Origin of Replication? oriC You walk along the genome and see that #G - #C have been decreasingand then suddenly startsincreasing: WHERE ARE YOU IN THE GENOME?

We Found the Replication Origin in E. Coli The minimum of the Skew Diagram points to this region in E. coli: • aatgatgatgacgtcaaaaggatccggataaaacatggtgattgcctcgcataacgcggtatgaaaatggattgaagcccgggccgtggattctactcaactttgtcggcttgagaaagacctgggatcctgggtattaaaaagaagatctatttatttagagatctgttctattgtgatctcttattaggatcgcactgccctgtggataacaaggatccggcttttaagatcaacaacctggaaaggatcattaactgtgaatgatcggtgatcctggaccgtataagctgggatcagaatgaggggttatacacaactcaaaaactgaacaacagttgttctttggataactaccggttgatccaagcttcctgacagagttatccacagtagatcgcacgatctgtatacttatttgagtaaattaacccacgatcccagccattcttctgccggatcttccggaatgtcgtgatcaagaatgttgatcttcagtg BUT… But there are no frequent9-mers (that appear three or more times) in this region!

Outline • Search for Hidden Messages in Replication Origin • What is a Hidden Message in Replication Origin? • Some Hidden Messages are More Surprising than Others • Clumps of Hidden Messages • From a Biological Insight toward an Algorithm for Finding Replication Origin • Asymmetry of Replication • Why would a computer scientist care about assymetry of replication? • Skew Diagrams • Finding Frequent Words with Mismatches • Open Problems

Back to E.coli 9-mers! • oriC of E.coliis determined by its Skew • None of the hundreds of 9-mers discovered by the frequent k-mers program is forming a clump(500,3) at the discovered oriC !!!!! • Before we give up, let’s examine the oriC of Vibrio cholerae one more time!

Back to Vibrio choleraeoriC! • atcaatgatcaacgtaagcttctaagcATGATCAAGgtgctcacacagtttatccacaacctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgaccacggaaagATGATCAAGagaggatgatttcttggccatatcgcaatgaatacttgtgacttgtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggattacgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttaggatagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaattgataatgaatttacatgcttccgcgacgatttacctCTTGATCATcgatccgattgaagatcttcaattgttaattctcttgcctcgactcatagccatgatgagctCTTGATCATgtttccttaaccctctattttttacggaagaATGATCAAGctgctgctCTTGATCATcgtttc • It contains 3 ATGATCAAG and 3 CTTGATCAT (its RC) • In addition, it contains ATGATCAAC and CATGATCAT • What are these? • These 2 9-mers are DnaA boxes, but with mismatches. • 8 9-mers in short region are more statistically surprizing than 6 9-mers! • atcaATGATCAACgtaagcttctaagcATGATCAAGgtgctcacacagtttatccacaacctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgaccacggaaagATGATCAAGagaggatgatttcttggccatatcgcaatgaatacttgtgacttgtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggattacgaaagCATGATCATggctgttgttctgtttatcttgttttgactgagacttgttaggatagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaattgataatgaatttacatgcttccgcgacgatttacctCTTGATCATcgatccgattgaagatcttcaattgttaattctcttgcctcgactcatagccatgatgagctCTTGATCATgtttccttaaccctctattttttacggaagaATGATCAAGctgctgctCTTGATCATcgtttc

Finding K-mers with mismatches • Discovery of these approximate 9-mers makes sense biologically, • Since DnaA can bind not only to “perfect” DnaA boxes and their RCs, but also to their slight modifications as well.

Compute the Hamming Distance Between Two Strings http://rosalind.info/problems/ba1g/

Find All Approximate Occurrences of a Pattern in a String 1H Use sliding window Sample Input:ATTCTGGA (So, K-mer is given!)CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT 3 Investigation: -> CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT Sample Output: 6 7 26 27 http://rosalind.info/problems/ba1h/ Testing dataset: http://bioinformaticsalgorithms.com/data/extradatasets/replication/approximate_match.txt

Find the Most Frequent Words with Mismatches in a String

Find the Most Frequent Words with Mismatches in a String 1i k d http://rosalind.info/problems/ba1i/ Testing dataset: http://bioinformaticsalgorithms.com/data/extradatasets/replication/frequent_words_mismatch.txt

Find Frequent Words with Mismatches and Reverse Complements 1J http://rosalind.info/problems/ba1j/ Testing dataset: http://bioinformaticsalgorithms.com/data/extradatasets/replication/frequent_words_mismatch_complements.txt

DnaA box of E.coli • After searching in the minimum skew point (500 nucleotides)for the most frequent words with Mismatches and Reverse Complements we found: TTATCCACA , GGATCCTGG, GATCCCAGC, GTTATCCAC, AGCTGGGAT, and CTGGGATCA all appear 4 times with 1 mismatch and reverse complements. What are these k-mers? Other hidden messages?

Finally,DnaA Boxes in E. Coli! Frequent 9-mers (with 1 Mismatch and Reverse Complements) in putative oriC of E. coli aatgatgatgacgtcaaaaggatccggataaaacatggtgattgcctcgcataacgcggtatgaaaatggattgaagcccgggccgtggattctactcaactttgtcggcttgagaaagacctgggatcctgggtattaaaaagaagatctatttatttagagatctgttctattgtgatctcttattaggatcgcactgcccTGTGGATAAcaaggatccggcttttaagatcaacaacctggaaaggatcattaactgtgaatgatcggtgatcctggaccgtataagctgggatcagaatgaggggTTATACACAactcaaaaactgaacaacagttgttcTTTGGATAACtaccggttgatccaagcttcctgacagagTTATCCACAgtagatcgcacgatctgtatacttatttgagtaaattaacccacgatcccagccattcttctgccggatcttccggaatgtcgtgatcaagaatgttgatcttcagtg The experiment confirmed DnaA box in E. coli (TTATCCACA) is indeed a most frequent 9-mer with 1 mismatch, along with its reverse complement TGTGGATAA:

Where In A Genome Does DNA Replication Begin?