Basic Bioinformatics

Basic Bioinformatics As it is applied to the Bacillus megaterium genome

What we are going to talk about • Why we are doing all this DNA sequencing • What genes look like and where they are found • How we can compare sequences between different species • How genes move between species

DNA Sequencing • Bioinformatics is based on the fact that DNA sequencing is cheap, and becoming easier and cheaper very quickly. • the Human Genome Project cost roughly $3 billion and took 12 years (1991-2003). • Sequencing James Watson’s genome in 2007 cost $2 million and took 2 months • Today, you could get your genome sequenced for about $100,000 and it would take a month. • The Archon X prize: you win $10 million if you can sequence 100 human genomes in 10 days, at a cost of $10,000 per genome. • It is realistic to envision $100 per genome within 10 years: everyone’s genome could be sequenced if they wanted or needed it.

Why it’s useful • All of the information needed to build an organism is contained in its DNA. If we could understand it, we would know how life works. • Preventing and curing diseases like cancer (which is caused by mutations in DNA) and inherited diseases. • Curing infectious diseases (everything from AIDS and malaria to the common cold). If we understand how a microorganism works, we can figure out how to block it. • Understanding genetic and evolutionary relationships between species • Understanding genetic relationships between humans. Projects exist to understand human genetic diversity. Also, sequencing the Neanderthal genome. • Ancient DNA: currently it is thought that under ideal conditions (continuously kept frozen), there is a limit of about 1 million years for DNA survival. So, Jurassic Park will probably remain fiction.

From DNA to Gene • But: extracting that information is difficult. How to convert a string of ACGT’s into knowledge of how the organism works is hard. • Most of the work is on the computer, with key confirming experiments done in the “wet lab”. • The sequence below contains a gene critical for life: the gene that initiates replication of the DNA. Can you spot it? • We are now going to spend some time on what genes look like and how we can find them. TTGGAAAACATTCATGATTTATGGGATAGAGCTTTAGATCAAATTGAAAAAAAATTAAGCAAACCTAGTTTTGAAACCTG GCTCAAATCGACAAAAGCTCATGCTTTACAAGGAGACACGCTCATTATTACTGCACCTAATGATTTTGCACGGGACTGGT TAGAATCTAGGTATTCTAATTTAATTGCTGAAACACTTTATGATCTTACGGGGGAAGAGTTAGATGTAAAATTTATTATT CCTCCTAACCAGGCCGAGGAAGAATTCGATATTCAAACTCCTAAAAAGAAAGTCAATAAAGACGAAGGAGCAGAATTTCC TCAAAGCATGCTAAATTCGAAGTATACCTTTGATACATTTGTTATCGGATCTGGAAATCGGTTTGCGCATGCAGCTTCTT TAGCAGTAGCAGAAGCGCCGGCTAAAGCGTATAATCCGCTTTTTATTTACGGGGGAGTAGGATTAGGCAAAACACACTTA ATGCACGCCATAGGCCACTATGTGTTAGATCATAATCCTGCCGCGAAAGTCGTGTACTTATCATCTGAAAAATTCACAAA CGAGTTTATTAACTCTATTCGTGACAATAAAGCAGTAGAATTCCGCAACAAATACCGTAATGTAGATGTTTTACTGATTG ATGATATTCAATTCTTAGCAGGTAAAGAGCAGACACAAGAAGAATTTTTCCATACGTTTAATACGCTTCACGAAGAAAGC AAGCAGATTGTCATCTCAAGTGATCGACCGCCGAAAGAAATTCCTACACTTGAAGATCGACTTCGCTCTCGCTTTGAATG GGGCCTTATTACAGACATCACACCACCAGATTTGGAAACACGAATTGCTATTTTGCGTAAAAAAGCCAAAGCGGACGGCT TAGTTATTCCAAATGAAGTTATGCTTTATATCGCCAATCAGATTGATTCAAATATTAGAGAATTAGAAGGCGCACTTATT

DNA • DNA is just a long string of 4 letters (nucleotides, or bases): Adenine, Guanine, Cytosine, and Thymine. • Which we will just refer to as A, C, G, and T • and we are skipping lots of details • Each DNA molecule has 2 strands, with the bases paired in the center • A on one strand always pairs with T on the other strand • G pairs with C. • the strands run in opposite directions (like roads) • Since the two DNA strands are complementary, there is no need to write down both strands

Chromosomes and Genes • each chromosome is a long piece of DNA • B. megaterium genome is a circle (like most bacteria) of about 5 million bases. • Human chromosomes are 100-200 million bases long. We have 46 chromosomes (2 sets of 23, one set from each parent). • genes are just regions on that DNA. It is not obvious where genes are if you look at a DNA sequence. • there is a lot of DNA that is not part of genes: in humans only 2% at most of the DNA is part of any gene. • Bacteria use more of their DNA: 80% of the B. meg chromosome is genes. • B. meg has about 1 gene per 1000 base pairs (bp) of DNA. About 5000 genes • Humans have about 25,000 genes. • We are far more complicated than bacteria: regulation of the genes is very complicated in humans • We use the same gene in different ways in different tissues

Genes and Proteins • Most genes code for proteins: each gene contains the information necessary to make one protein. • Proteins are the most important type of macromolecule. • Structure: collagen in skin, keratin in hair, crystallin in eye. • Enzymes: all metabolic transformations, building up, rearranging, and breaking down of organic compounds, are done by enzymes, which are proteins. • Transport: oxygen in the blood is carried by hemoglobin, everything that goes in or out of a cell (except water and a few gasses) is carried by proteins. • Also: nutrition (egg yolk), hormones, defense, movement

The Genetic Code • Proteins are long chains of amino acids. • There are 20 different amino acids coded in DNA • There are only 4 DNA bases, so you need 3 DNA bases to code for the 20 amino acids • 4 x 4 x 4 = 64 possible 3 base combinations (codons) • Each codon codes for one amino acid • Most amino acids have more than one possible codon • Genes start at a start codon and end at a stop codon. • 3 codons are stop codons: all genes end at a stop codon. • Start codons are a bit trickier, since they are used in the middle of genes as well as at the beginning • in eukaryotes, ATG is always the start codon, making Methionine (Met) the first amino acid in all proteins (but in many proteins it is immediately removed). • In prokaryotes, ATG, GTG, or TTG can be used as a start codon. B. meg prefers ATG, but about 30% of the genes start with GTG or TTG. In bioinformatics, we generally ignore the fact that RNA uses the base uracil (U) in place of T.

Gene Expression • How do you get a protein from a gene? • A two-step process (called the Central Dogma of Molecular Biology). • First, the gene has to be copied (transcribed) into an RNA form. • The RNA copy (messenger RNA) is exactly like the gene itself, except RNA replaces T with U. • Most gene regulation: whether the gene is “on” or “off” happens here • Second, the RNA is translated into protein by ribosomes, which are complex RNA/protein hybrid machines. • With the help of transfer RNA molecules, which have one end that matches the 3 base codon and the other end that is attached to the proper amino acid. • The ribosome starts at the start codon and moves down the messenger RNA, adding one amino acid at a time to the growing chain. When the ribosome reaches a stop codon, it falls off, releasing the new protein.

Reading Frames • Here we get a bit subtle. • Since codons consist of 3 bases, there are 3 “reading frames” possible on an RNA (or DNA), depending on whether you start reading from the first base, the second base, or the third base. • The different reading frames give entirely different proteins. • Consider ATGCCATC, and refer to the genetic code. (X is junk) • Reading frame 1 divides this into ATG-CCA-TC, which translates to Met-Pro-X • Reading frame 2 divides this into A-TGC-CAT-C, which translates to X-Cys-His-X • Reading frame 3 divides this into AT-GCC-ATC, which translates to X-Ala-Ile • Each gene uses a single reading frame, so once the ribosome gets started, it just has to count off groups of 3 bases to produce the proper protein.

Open Reading Frames • Ribosomes are very obedient to stop codons: when a stop codon is reached, the protein is finished. Thus, all genes end at the first stop codon in their reading frame. • Since 3 out of the 64 codons are stop codons, random DNA has stop codons very frequently. • However, genes do something necessary for survival, so natural selection keeps stop codons out of the middle of genes. • That is, if a mutation arises that creates a stop codon in the middle of a gene, the organism dies and leaves no descendants. • Open reading frames (ORFs) are regions with no stop codons. All genes reside in long open reading frames • Note that stop codons in other reading frames have no effect on the gene. • The start codon must occur “upstream” in the same reading frame as the stop codon. It is usually near the beginning of the ORF, but not necessarily the first possible start codon. • Determining the exact start codon is not easy or obvious. • But, the first stop codon in an open reading frame is always a reasonable guess This is a map of the stop codons in all 3 reading frames in a stretch of DNA. The long ORF in reading frame 1 is highlighted in black.

Gene Placement • Genes can occur on either DNA strand. • If they are on the reverse strand, the DNA sequence needs to be reversed and complemented • In bacteria, most of the DNA is part of a gene. Most long open reading frames (say 100 bp or longer) that don’t overlap other long ORFs contain genes • Most genes do not overlap each other. • Sometimes there are very short overlaps (50 bp or less), especially if the two genes are functionally related. • In bacteria, genes that affect the same biochemical pathway or function are sometimes adjacent to each other on the same DNA strand (not necessarily the same reading frame), allowing them to be co-regulated • This group of genes is called an “operon” • Operons only exist in bacteria; they are not present in eukaryotes at all.

Finding Genes • First job is to find long ORFs, examining the longest ORFs first and putting together a set with minimal overlaps. • It is also necessary to identify potential start codons, with the furthest upstream start codon as the easiest choice. • Then, how do we know that the ORF contains a real gene? The most definitive way is to match it with a gene known from other species • conservation of a sequence between species strongly suggests that the sequence has a function that is being conserved by natural selection • We compare protein sequences, not DNA, because protein is more conserved in evolution than DNA • The organism’s survival depends on the protein being functional, which means having the proper amino acids sequence • Since the genetic code is degenerate, many different DNA sequences will give identical proteins. • The protein 3-dimensional structure is even more conserved, because it is more closely related to enzyme activity than the amino acid sequence is. • However, we don’t have good ways of determining 3-D structure from a DNA sequence

Sequence Comparison • So, we compare our ORF sequence to a database of known protein sequences from many species. • BLAST is the standard sequence alignment tool (BLAST = Basic Local Alignment Search Tool) • BLAST is based on the concept that if you compare the same (that is, homologous) protein from many different species, you can see that some amino acids readily substitute for each other and others almost never do. • A substitution matrix, giving a score for each amino acid position in the proteins being compared.

Practical BLAST • BLAST itself is a bit of software that can be run on almost any computer, but the database needed for a good cross-species comparison is quite large • the database is called “nr” for “non-redundant”, and it contains at least 20 Gb of sequence data • We are going to use the BLAST service at UniProt, a European consortium that contains a comprehensive collection of protein sequences • http://www.uniprot.org/ • Nearly all derived from DNA sequences: direct sequencing of proteins is difficult • Terminology: your sequence, which you paste into the box on the web site, is the query sequence. Sequences in the database that match yours are called subject sequences.

A Sequence to BLAST • This is a more-or-less randomly chosen gene from B. meg. • It is 174 amino acids long • It is written in “fasta” format: the first line starts with > and is immediately followed by an identifier (ORF00135), and then some miscellaneous comments. • After that the sequence is written without spaces or other marks. >ORF00135 |chromosome 538197-538721 revcomp MKAKLIQYVYDAECRLFKSVNQHFDRKHLNRFLRLLTHAGGATFTIVIACLLLFLYPSSVAYACAFSLAVSHIPVAIAKKLYPRKRPYIQLKHTKVLENPLKDHSFPSGHTTAIFSLVTPLMIVYPAFAAVLLPLAVMVGISRIYLGLHYPTDVMVGLILGIFSGAVALNIFLT

Results

BLAST Scores • Results are arranged with the best ones on top • The most important score is the Expect value, or E-value, which can be defined the number of hits any random sequence (with the same length as yours) would have in the database. • E-values for good hits are usually written something like: 3e-42, which is the same as 3 x 10-42 , a very small number • Bad hits are very common, and they have e-values in a more familiar form: for example, 0.004 or 1.2 • A really good e-values is less than 1e-180, which underflows the computer’s processing capabilities, so it written as 0.0 • E-values are affected by the length of the query sequence as well as the size of the database, so even perfect matches with short sequences give poor e-values • In this case we see many hits with good e-values, and the top e-values all are quite similar. • Before we can conclude that our protein is a homologue of the proteins BLAST matches it with, we would like them to have roughly the same length and have a high percentage of identical amino acids. • the lengths of the query and subject sequences should be within 20% of each other • There should be at least 30% identical amino acids • In this case we can be quite sure we have a good match • BLAST also returns a fourth value, the bit score, which we are going to ignore.

Gene Names • Mostly genes are named with the function of their protein. • at some point, some related genes had their function determined through lab work: by examining the effects of mutations in the gene, by isolating and studying the protein produced by the gene, etc. • Enzymes (end in –ase), transport across the cell membrane, genetic information processing (DNA->RNA->protein), structural proteins, sporulation and germination, and more! • Many genes (maybe 1/4 of them in a typical genome) have no known function, although they are found in several different species: conserved hypothetical genes • Every new genome has some genes that are unique: no matching BLAST hits in the database. • Are they real genes? Sometimes there is evidence in the form of messenger RNA, but usually we don’t know • call them hypothetical genes • “putative” means that we think we know the gene’s function but we aren’t sure. Putative should be followed by the function name.

More Gene Names • One question of interest: do the names of the top BLAST hits agree with each other? They should, but there are always annotation errors, and our knowledge of gene function increases over time. • With some sloppiness due to different naming conventions practiced by different scientists • Here we have a classic case of mis-naming. Why is the top hit ribosomal protein S2, with no other hit having this name? • Ribosomal proteins are highly conserved in evolution • Some checking on my part showed that no homology exists between this gene and the ribosomal protein S2 found in any other Bacillus species • The other names are similar, although not identical. • What is “PAP2”? A quick Google search shows that it stands for “phosphatidic acid phosphatase”, which fits the other names well. • There is probably some uncertainty about its exact function, given the variety of names and the “family protein” designation in several of them.

Horizontal and Vertical Gene Transfer • We are accustomed to thinking of genes being passed from parent to offspring, always staying within the species, with very occasional splitting of one species into two. • This is called vertical gene transfer. • But, we know that some genes are transferred across species lines, not by the standard genetic mechanisms. • This is called horizontal gene transfer • It is rare in humans and other higher organisms • In bacteria 10% or more of genes have been transferred in horizontally. • B meg genes that come from vertical descent have other Bacillus species (or another closely related species) as the closest BLAST hit • Horizontally transferred genes can come from almost anywhere: other bacteria, Archaea, eukaryotes: plants, animals, fungi • The general mechanisms are well known, including conjugation (direct transfer of DNA between two bacteria), transduction (transfer of DNA using a virus as a carrier), and transformation (the bacteria pick up DNA molecules from their environment.

Bacillus Phylogeny • “Kings Play Chess On Fine Ground Sand” • Bacteria is the domain • Firmicutes is the phylum • Bacilli is the class • Bacillales is the order • Bacillaceae is the family • Bacillus is the genus.

Our Example • Most of the top hits are from various Bacillus species: there is little doubt that this gene is the results of normal, vertical gene flow. • What about “Anoxybacillus flavithermus”? • Click on the accession number to get more information, including its phylogeny. • Taxonomic lineage = Bacteria > Firmicutes > Bacillales > Bacillaceae > Anoxybacillus. • Same family as B meg.

Aligned Sequences • You can see the aligned sequences by clicking on the “Local alignment” diagrams • Query sequence on top, subject below • Identical amino acids are in the middle of the alignment, and similar ones have a + sign. • Gaps: regions where one sequence has amino acids not found in the other sequence, are indicated with ---. • This protein is very typical in that the best matches are in the middle of the protein, with fewer identical amino acids near the ends. • Also, the match doesn’t quite make it to the very beginning of the proteins, although they are almost identical in length. • The active site of most enzymes is in the middle • The ends of proteins are often not well conserved

Local Alignment Result

Graphical Overview • Click on Graphical Overview (just under the BLAST box on the left) to get an overview of all the aligned sequences • The extent of the matching region is shown with the colored boxes, with non-matching regions drawn as a line. • Color indicates percent of identical amino acids • You can see that mostly our query and the various subjects (matches) line up along almost all of their lengths. • This is a good way to check whether our start site is reasonable. • A few odd ones lower down. • Genes, and pieces of genes, can move to new locations in the genome, fuse with other genes, break apart, etc. Always subject to natural selection: if the altered gene doesn’t work, the organism will die and we won’t see it. • And of course, sequencing and annotation errors occur.

The Basic Points • DNA can be read in 3 different reading frames, a consequence of the genetic code (3 bases = 1 amino acid) • Genes are found in long open reading frames, areas where there are no stop codons. • BLAST is the tool we use to compare sequences between species • BLAST scores (e-values) describe the probability of finding a random sequence in the database • Gene sequences are conserved between species by natural selection • DNA sequences outside of genes are much less conserved • Most genes are transferred vertically, from parent to offspring, but a significant number are transferred horizontally, from unrelated species).

End

Other Stuff • Within-species BLAST--are there duplicate genes? Do their names match? What is most closely related species? Present in both strains? • Are nearby genes related by subsystem?

Basic Bioinformatics

Basic Bioinformatics

Presentation Transcript

Bioinformatics

Bioinformatics:

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

How Bioinformatics can change your life Basic Concepts of Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Basic bioinformatics tools for studying proteins

Bioinformatics

Bioinformatics