500 likes | 725 Views
Introduction to Bioinformatics Research Project. The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose a research group and how to begin a protein-centered research project, including how to find useful articles.
E N D
Introduction to BioinformaticsResearch Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose a research group and how to begin a protein-centered research project, including how to find useful articles. Here I present an example of a DNA-centered research project, beginning mostly after the useful-article stage
How to choose your research group In this alternate universe, I’m in the DNA replication group. I’m particularly interested in the DNA sequences that determine the initiation of DNA replication. I’ve even read an article or two about them, discovering... Mueser TC et al (2010) Virol J 7:359
Origin of DNA replication Origin Circular, dsDNAgenome ...that DNA in prokaryotes and their phages is primarily circular. To replicate it, the circle has to be opened at some point. That point is called the origin of replication.
Origin of DNA replication Origin Circular, dsDNAgenome Bidirectionalinitiation Opening the circle at the origin exposes two single-strands. Both are replicated, with the replication fork moving in both directions, away from the origin.
Origin of DNA replication Origin Circular, dsDNAgenome Eventually, two separate daughter circles are formed. ...But enough chatting. The issue is how is the starting point chosen? Bidirectionalinitiation Elongation Separation
Origin of DNA replication Origin Zooming in on the origin, we see the two intertwined strands at oriC (i.e., the Origin of the Chromosome)
Origin of DNA replication Origin + What makes the origin special is that it binds proteins essential for initiating replication. The picture shows green DnaA protein binding to the origin – also a protein called FIS (more on this in a moment).
Origin of DNA replication Origin DnaA binds not only to DNA but also to each other. With the help of a second DNA-binding protein, IHF (keep waiting), the bound DnaA proteins form a blob that distorts the DNA. The two strands of DNA separate at a nearby AT-rich region (you may recall that AT-rich regions are less stable than GC-rich regions) + +
Origin of DNA replication Origin That’s the general idea. For the rest of this project, I’m going to focus on DnaA, but before leaving the other protein behind... (I hate throwing around undefined acronyms...) + + FISFactor for Inversion Stimulation in Phage Mu FIS was first discovered as a protein important in gene regulation by a phage.
Origin of DNA replication Origin + IHFIntegration Host Factor for lysogeny of Phage Lambda Same with IHF. It was first found as a protein used by a phage to integrate its genome into the bacterial genome. It’s amazing how many things were first found in phages. +
Origin of DNA replication Origin + How to recognize origin of replication? But back to the main question at hand. I want to learn how to recognize origins of replication. If I build a tool that can find known bacterial origins, maybe I can use the tool to search for origins in bacteriophages. Do phages have the same sorts of origins? Don’t know. +
Origin of DNA replication Origin + How to recognize origin of replication? But how to tell? One thing that distinguishes origins is their ability to bind DnaA protein -- if DnaA binds to a specific sequence, then origins must have multiple copies of them in close proximity. Does DnaA bind to a specific sequence? +
Origin of DNA replication DnaA binding site Kaguni (2006) Annu Rev Microbiol 60:351-371. Is DnaA binding to DNA specific? I found an article that says the answer is yes. The E. coli origin of replication, pictured above, has five specific binding sites for DnaA. I need to learn more about that sequence. Orange colored boxes are nice, but at this point, I need to get closer to the truth, closer to the sequence.
Origin of DNA replication DnaA binding site Kaguni (2006) Annu Rev Microbiol 60:351-371. Fuller et al (1984) Cell 38:889-900. Here’s the sequence of the E. coli origin region. R1-R4 represent the sequences protected by DnaA when it binds. Are the all the same sequence?
Origin of DNA replication DnaA binding site Kaguni (2006) Annu Rev Microbiol 60:351-371. Fuller et al (1984) Cell 38:889-900. For example R1 and R2... Are they the same sequence? Why are there two sets of nucleotides in each box?
Origin of DNA replication DnaA binding site Kaguni (2006) Annu Rev Microbiol 60:351-371. Fuller et al (1984) Cell 38:889-900. If you notice that both strands of the DNA are shown, then you can make more sense of the boxes.
Origin of DNA replication DnaA binding site Kaguni (2006) Annu Rev Microbiol 60:351-371. R1 TTATCCACAR2 TTATACACAR3 TTATCCAAAR4 TTATCCACA Fuller et al (1984) Cell 38:889-900. Putting all the boxes together (choosing one of the two strands arbitrarily), I begin to see a pattern. Kaguni said there was also R5 (M). Where’s that?
Origin of DNA replication Enough orange boxes! Even enough paper sequences! If I’m going to make an origin-finding tool, I need to test it on a known case – Why not this case? Can I find the E. coli origin by DnaA-binding sequences? R1 TTATCCACAR2 TTATACACAR3 TTATCCAAAR4 TTATCCACA Fuller et al (1984) Cell 38:889-900.
My goal is to make a general origin-finding tool, using the E. coli origin as a test case. I therefore need to find the coordinates of the E. coli origin, so I can tell if my tool is working. Since I'm going to build the tool in BioBIKE, I need the coordinates known to BioBIKE. There's no point finding the origin in Genbank or anywhere else. PhAnToMe is where you’ll find E. coli and phage sequences.
How do I find the E. coli origin in E. coli? My general origin-finding tool will look for DnaA-binding sites. I think that will work to find the E. coli origin, but I don't know it will work. I need the coordinates of the E. coli origin so I can test my unproven tool with a known case. So, how can I find the E. coli origin with absolute certainty? What do I have in hand to enable me to find it?
What do I have in hand to enable me to find the origin? Of course I have the sequence. That's essentially foolproof, so long as I have available the E. coli genome sequence to search through. Looking for the sequence is much more certain than looking for DnaA boxes or some region annotated as “the origin”
One strategy is to display the sequence of E. coli K12 (which is the standard laboratory strain).
Searching for some portion of the published origin sequence should get me to the right place in the genome. It doesn’t matter much which part of the origin I choose.
How could that be?!? I recheck the sequence... No problem. Searching for some portion of the published origin sequence should get me to the right place in the genome. It doesn’t matter much which part of the origin I choose.
When some strategy fails for no apparent reason and defies your best efforts to understand why, it is a generally a good idea to try something completely different, even though the different strategy may not sound any more promising. It is the worm that wiggles that gets off the hook. So I try searching the E. coli genome for the same sequence, using a high threshold (expect value of 10, which would allow even rare random matches to sneak through).
That was informative! The first match goes from the beginning to end (Q-start=1, Q-end=30) of the 30-nucleotide sequence I gave it, but the match was only 96.67%. There must be a mismatch somewhere! The other matches are very partial with poor E-values. I’ll ignore them.
Where is the mismatch? The ALIGNMENT-OF function allows me to compare the 30-nucleotide query sequence with the actual sequence from E. coli. I used the coordinates provided by SEQUENCE-SIMILAR-TO to pick out the relevant portion of the genome.
Ah! The original article from which I got the origin sequence had an error in it, an extra G! This is not so surprising. In 1984 (the year of the article), all sequencing was done by hand with little redundancy. In any event, I think I found the origin – around coordinate 3923300
Note how I got to this region: Clearing the Search field, entering the coordinate in the Go To field, and clicking Go. Don’t be concerned about the blank lines on the top and the mayhem on the right. The E. coli genome happens to have lots of sequence features that people have annotated, and the Sequence Viewer doesn’t handle them very well.
First to confirm: Is this the right sequence? The first 30 nucleotides should match, of course (except for one). What about the rest? I’ll check the first 80... Check!
R1 TTATCCACAR2 TTATACACAR3 TTATCCAAAR4 TTATCCACA Does the region have the DnaA-binding motifs? I could search for each individual sequence, but it’s more efficient to search for the pattern that encompasses all of them. ...Why only two? What happened to the other two?(you might want to look several slides back at the sequence)
I can't depend on my own eyes. I need to automate the process. MATCHES-OF-PATTERN will search for the same DnaA-binding pattern but return all the results at once. There’s no preference which of the two strands a DnaA protein will bind to, so I specify BOTH-STRANDS.
Note that the results are shown formatted in a popup window for immediate gratification and also in the result pane for further use. There are a lot of sequences matching the pattern! How many? And how many would you expect by chance?
How many? That’s the easy one. I just counted the list (using * to indicate the previous result) How many expected by chance? Not much worse. You’ve done this sort of calculation many times in the past and will do so many times in the future. You should reach the conclusion that most of the matches are garbage.
If a mere match to a DnaA-binding sequence is not informative, then how can we recognize an origin? What’s distinctive about the origin is that it contains a cluster of DnaA-binding sites. Unfortunately, it is difficult to recognize clusters of sites because the sites’ coordinates are not sorted. That’s the next step.(And then to clean up the screen)
That’s much better! With the sorted list, I can see the cluster of four DnaA-binding sites at the known origin of E. coli (at coordinate ~3923000). Maybe there are other clusters? I’m not sure I’m up to peering through the entire list. However, I can see how I’d do it, examining each line with respect to its neighbors and keeping only those sites that are close to other sites. I need to automate this process to create the tool that can scan hundreds of genomes looking for origins of replication.
Automation of this sort of thing will come later. Can't do everything at once. For now, I'll package the progress I've made to enable me to experiment easily. I'll take the steps I've developed and put it into a function
My function consists of no more than what I did step by step. Now it has a name. Also, I generalized it to work with any genome, not just E. coli. Does it work?
Yes! Executing the function (now on my FUNCTION button) with E. coli as the argument gives exactly the same result as I got before. Will it work with other organisms?
That’s much better! With the sorted list, I can see the cluster of four DnaA-binding sites at the known origin of E. coli (at coordinate ~3923000). Maybe there are other clusters? I’m not sure I’m up to peering through the entire list. However, I can see how I’d do it, examining each line with respect to its neighbors and keeping only those sites that are close to other sites. I need to automate this process to create the tool that can scan hundreds of genomes looking for origins of replication. Maybe! I tried it on Yersinia pestis (causative agent of the plague) and got a very provocative result. What's the odds that five DnaA-sites would come up in the first 2000 nucleotides by chance? (do the calculation)
That’s much better! With the sorted list, I can see the cluster of four DnaA-binding sites at the known origin of E. coli (at coordinate ~3923000). Maybe there are other clusters? I’m not sure I’m up to peering through the entire list. However, I can see how I’d do it, examining each line with respect to its neighbors and keeping only those sites that are close to other sites. I need to automate this process to create the tool that can scan hundreds of genomes looking for origins of replication. With this function in hand, I can experiment, checking whether my method is any good. I will undoubtedly find that it could be improved in lots of ways. The ability to do quick experiments and gain rapid feedback enables my ideas to evolve.
Origin of DNA replicationAlgorithm (where it stands) * Search genome sequence for DnaA-binding sites - TTAT[CA]CACA - (not perfect – allow one mismatch?) - Use MATCHES-OF-PATTERN * Sort sites by coordinate - Use SORT * Look for clusters of sites - (How???) (Eventually) Apply to all phage genomes
Morals of the Story * Make problem tangible Abstractions can give you a comforting big picture, but you won't make any progress unless you can connect the abstractions to reality
Morals of the Story * Make problem tangible * Test ideas by experimentation Develop your methods using cases where the answer is already known.
Morals of the Story * Make problem tangible * Test ideas by experimentation * Package your insights into functions Start with an imperfect function and let it evolve as you gain more experience.
Morals of the Story * Make problem tangible * Test ideas by experimentation * Package your insights into functions * Test the limits of your method Try weird cases. Figure out why the method fails (if it fails) and what would make it not work (if it works). Do lots of experiments.
Morals of the Story * Make problem tangible * Test ideas by experimentation * Package your insights into functions * Test the limits of your method * When things don't work (inevitable), cope Try something different. Try lots of somethings different.
Morals of the Story * Make problem tangible * Test ideas by experimentation * Package your insights into functions * Test the limits of your method * When things don't work (inevitable), cope * When things continue not to work, talk with others Sometimes pooled confusion can lead to light.
TATTCAAAATGAATTATATCGGTAAATATCTGCAACTTTAAACCTGAATGAGGATTTAGTATTGCTGGGCCAGCCCAAAGTTTAGAATTTTCATCAACTTTGCACAATGATGGAAAACGTGAATTCAAAAGGATTGCTATATATTATTAAGAAAACATTTGGAATTCGAGAACCGGAATATGGCATTCCGCAAATTAGAGAACGGAATAGGTATTCCTAAAAAAACACATTCTCTGCAATTTTTAAGATGAGTATTATACCTGCACTAACTTTGTGGGACGCAATATCAGAGCAACCCTATCATTTAAAACCTCAAAATACTTATCAGACTTGGGGAACATTCTGACCGTTTAGTAGAACGTTTCCGGCATATAAAATGGGGTGAAGTGGTAATGGTGAATTATCAAACAAATCATATGATCAGAATAATCGCCGTTTAAATCCATCCTTTTCAACATCGAAATTTAACAGCCCGTGAAGGAGCTAGAATCCAATCTTTTCCAGGAAGAAAGATTTGATGAAAAATTTCTTTGTCAATATAATCAAATCGGTAATGCTGTACCCCCTCTTCTCGCTAGTGCATGGATCAAATCTTGAACAAAAAGAGAATCATCGTACAAAATACAGAGATACTGAAAGCAGGACTTTCCTTAGAGAAATCAAGATGATTCAATTATTACTCAAAGAGTGGAACTTCTCACTAAATATAAAGATTTTTTAGATCAGCAGCATTATGCAGAAAAATTTGATTCAAGATCCAACCGCTCATAATCCTTACTGAGACGACGGTACTGGTTTAACCAGCCAAATGTTCTTTCTACTACCCACCGTTTGGGCAAAACCTGAAATTCTTGATTAGTACGCCGGATTACCTCAACATGAGCTTGAATCTAGGCGGCAAGTAATCTTTCTCCAGCATTTGCTTCACTTACAACCACTTTTAACAAAAGTCCCAGACTATCAACCAAAGTTTGCCGCTTTCGTCCTTTTACCTTCTTGCCACCATCAAAACCGTACACATCCCCCTTTTTTCAGTCGTTTTTACCGACTGGCTGTCTGCCGCGATCGCCGTGGGTTGAGTTGACTTCCCCATTTTTTGACGAACTTGATCGCGCAAAGTATGATTCATTTCAGTTGAACTAGGAGGAAAATCCCCTGGAAGCATATCCCACTGAATTCGAATTCGAATTCGAATTCGAATTCGACAACCTGTTTTCAGATGGTAGTAGATAGCGTTGCATACTTCTCGCATATCAGTTGTTCGGGGATGCCCACCGCATTTAGCGGGTGGAATCAAAGGAGCTAAAATTGCCCATTCTGAGTCATTAAGGTCTGTAGAATAAGACTTTCGTCTCATTGTTTCCTATGTAAATACACTCTACAAACAGTATCTTATCGCTGCCTTTTTATCTTAGCTCTCCTTTAGATTTACTTTATAAATAGCCTCTTAGAAGAATTTCTTTATTATTTATTTAAAGATTTAGTACAAGATTTCGGGCAGAACGCTCTTATTGGTAAGTCACACACGTTCAAAGATATTTTCTTCGTACCACCAAAATATTCTGAAATGCTCAAGCGACCTTATGCGCGAATTGAGAGAAAAGATCATGATTTCGTAATTGGTGCAACTGTTCAAGCATCGCTTGAAGCAGCACCTCCTCCAGAACAAAACCATGCTTGAGGGATCTTCACGCGCAGCAGAGGATTTAAAAGCGAGAAATCCTAACAGTTTATACCTTGTGGTTATGGAATGGATAAAACTGACCAATGATGTAAATTTACGAAAATATAAAGTTGATCAAATTTATGTACTACGTCAGCAAAAAAATACTGATAGAGAGTTTAGGTATGAGTCAACTTACATAAAAAATTATTCAAAATGAATTATATCGGTAAATATCTGCAACTTTAAACCTGAATGAGGATTTAGTATTGCTGGGCCAGCCCAAAGTTTAGAATTTTCATCAACTTTGCACAATGATGGAAAACGTGAATTCAAAAGGATTGCTATATATTATTAAGAAAACATTTGGAATTCGAGAACCGGAATATGGCATTCCGCAAATTAGAGAACGGAATAGGTATTCCTAAAAAAACACATTCTCTGCAATTTTTAAGATGAGTATTATACCTGCACTAACTTTGTGGGACGCAATATCAGAGCAACCCTATCATTTAAAACCTCAAAATACTTATCAGACTTGGGGAACATTCTGACCGTTTAGTAGAACGTTTCCGGCATATAAAATGGGGTGAAGTGGTAATGGTGAATTATCAAACAAATCATATGATCAGAATAATCGCCGTTTAAATCCATCCTTTTCAACATCGAAATTTAACAGCCCGTGAAGGAGCTAGAATCCAATCTTTTCCAGGAAGAAAGATTTGATGAAAAATTTCTTTGTCAATATAATCAAATCGGTAATGCTGTACCCCCTCTTCTCGCTAGTGCATGGATCAAATCTTGAACAAAAAGAGAATCATCGTACAAAATACAGAGATACTGAAAGCAGGACTTTCCTTAGAGAAATCAAGATGATTCAATTATTACTCAAAGAGTGGAACTTCTCACTAAATATAAAGATTTTTTAGATCAGCAGCATTATGCAGAAAAATTTGATTCAAGATCCAACCGCTCATAATCCTTACTGAGACGACGGTACTGGTTTAACCAGCCAAATGTTCTTTCTACTACCCACCGTTTGGGCAAAACCTGAAATTCTTGATTAGTACGCCGGATTACCTCAACATGAGCTTGAATCTAGGCGGCAAGTAATCTTTCTCCAGCATTTGCTTCACTTACAACCACTTTTAACAAAAGTCCCAGACTATCAACCAAAGTTTGCCGCTTTCGTCCTTTTACCTTCTTGCCACCATCAAAACCGTACACATCCCCCTTTTTTCAGTCGTTTTTACCGACTGGCTGTCTGCCGCGATCGCCGTGGGTTGAGTTGACTTCCCCATTTTTTGACGAACTTGATCGCGCAAAGTATGATTCATTTCAGTTGAACTAGGAGGAAAATCCCCTGGAAGCATATCCCACTGAATTCGAATTCGAATTCGAATTCGAATTCGACAACCTGTTTTCAGATGGTAGTAGATAGCGTTGCATACTTCTCGCATATCAGTTGTTCGGGGATGCCCACCGCATTTAGCGGGTGGAATCAAAGGAGCTAAAATTGCCCATTCTGAGTCATTAAGGTCTGTAGAATAAGACTTTCGTCTCATTGTTTCCTATGTAAATACACTCTACAAACAGTATCTTATCGCTGCCTTTTTATCTTAGCTCTCCTTTAGATTTACTTTATAAATAGCCTCTTAGAAGAATTTCTTTATTATTTATTTAAAGATTTAGTACAAGATTTCGGGCAGAACGCTCTTATTGGTAAGTCACACACGTTCAAAGATATTTTCTTCGTACCACCAAAATATTCTGAAATGCTCAAGCGACCTTATGCGCGAATTGAGAGAAAAGATCATGATTTCGTAATTGGTGCAACTGTTCAAGCATCGCTTGAAGCAGCACCTCCTCCAGAACAAAACCATGCTTGAGGGATCTTCACGCGCAGCAGAGGATTTAAAAGCGAGAAATCCTAACAGTTTATACCTTGTGGTTATGGAATGGATAAAACTGACCAATGATGTAAATTTACGAAAATATAAAGTTGATCAAATTTATGTACTACGTCAGCAAAAAAATACTGATAGAGAGTTTAGGTATGAGTCAACTTACATAAAAAAT