Computer Supermodels: Understanding Model Systems With Bioinformatics

Computer Supermodels:Understanding Model SystemsWith Bioinformatics George Bell & Robert Latek Bioinformatics and Research Computing Whitehead Institute for Biomedical Research http://web.wi.mit.edu/bio/

Aims • Bioinformatics To Study Model Systems • Phenotypic Evolutionary Trees • Comparative Genomics • Homologene • Sequence Phylogenetic Trees • Future Directions WIBR Bioinformatics, © Whitehead Institute 2004

Data Visualization Bioinformatics ? • Definition • Integration of computational and biological methods to promote biological discovery • Combination of Biology, Statistics, CS, Clinical Research • Purpose • Decipher Similarities, Predict Functions, Visualize Results • Methodology • Data Mining and Comparisons MSRKGPRAEVCADCSAPDPGWASISRGVLVCDECCSVHRLGRHISIVKHLRHSAWPPTLLQMVHTLASNGANSIWEHSLLDPAQVQSGRRKAN G. Bell WIBR Bioinformatics, © Whitehead Institute 2004

Bioinformatics :-) • Biological Comparisons (Evolutionary Analysis) • How closely related are two species/populations/sequences? • Gene Function Prediction • How and why does Gene X function in human compared to Gene X in worm? • Pharmaceutical Design & In Silico Testing WIBR Bioinformatics, © Whitehead Institute 2004

Traditional Evolutionary Trees Rooted Un-rooted Human Fly Human Mouse Ancestor Fly Mouse Worm Worm Based on PHYSICAL (phenotypic) characteristics WIBR Bioinformatics, © Whitehead Institute 2004

Species Trees • Tree Of Life • http://tolweb.org/tree/phylogeny.html • Theory Of Human Evolution At The SI • http://www.mnh.si.edu/anthro/humanorigins/ha/a_tree.html WIBR Bioinformatics, © Whitehead Institute 2004

Sequences As Modules • Proteins are derived from a limited number of basic building blocks (Modules) • Evolution has shuffled these modules giving rise to a diverse repertoire of protein sequences • As a result, proteins can share a global relationships or local relationship specific to a particular DOMAIN Global Local WIBR Bioinformatics, © Whitehead Institute 2004

Comparative Genomics For many of us, Homo sapiens is our favorite species. We're very interested in studying human biology ideally to make us healthier and perhaps even to improve our quality of life. We know the human genome has been sequenced. So why study any other organisms? One reason could be that we happen to like mice,rats, dogs, cats, and other animals. Many biologists, however, study these animals to try to learn more about human biology. As researchers do large-scale DNA sequencing of an organism, they can begin to study its genomics. Genomics is the study of an entire genome, which usually means either all of its DNA or all of its genes. Comparative genomics is the analysis and comparison of genomes from different species. Biologists have been comparing different organisms for a very long time, and comparative genomics is simply an extension of this understanding that we can learn more from comparison than by studying one organism (or car or work of literature) by itself. Since comparative genomics involves studying a lot of data, it almost always requires computers for help. Many people who study comparative genomics need to know a lot about bioinformatics and computational biology (two terms that refer to the use of computers to study biology). This exercise will show you to some popular bioinformatics tools to help introduce you to comparative genomics. To do this comparative genomics exercise, we'll be looking at MSH2, one member of a family of mismatch repair genes. These so-called "spellchecker" genes help to preserve the integrity of the genetic code during DNA replication. When DNA is copied (replicated) imperfectly, these mismatch repair genes can recognize the mistake so it can be corrected. When MSH2 (one of the mismatch repair genes in humans) contains mutations, it can no longer act as a spellchecker to other genes. The type of mutation in the genome that is most often seen is instability of regions containing short (di- or tri-nucleotide) repeats. Thus a mutation in the MSH2 gene causes errors to accumulate in other genes and can lead to one form of colon cancer. WIBR Bioinformatics, © Whitehead Institute 2004

Finding Homologous Genes An important part of comparative genomics is finding the "same gene" in different organisms. Instead of talking about genes being "the same", biologists prefer to talk about homologous genes. Homologous genes are similar genes in one or more species that have a very specific property: they both came from the same ancestor gene. Human and mouse homologs, for example, would have come from the same gene about 75 million years ago in an organism that eventually evolved in two directions, into both humans and mice. To find homologs of human MSH2, we could search a database of DNA sequences, like GenBank's nr (virtually all nucleotide sequences) database. Otherwise, we could go to NCBI's LocusLink page of all known genes in some well-studied species and search for MSH2. But, in fact, biologists have already compiled many sets of homologous genes into the NCBI HomoloGene database. Go to HomoloGene, enter MSH2 as the search term, and hit Go. The top hit should be "HomoloGene:210. Gene conserved in Eukaryota". Click on the link next to the "1" to go to the page (#210) for the MSH2 homologs. On the left the page shows MSH2 genes in 11 different species. On the right are the 11 proteins encoded by these genes. MSH2 is somewhat unusual by being conserved from human to yeast; there are a lot of genes, for example, that are only found in more closely related species. Interestingly, MSH2 was first identified in yeast and later identified in the human as a potential cause of some cases of colon cancer, as shown in this summary. This summary from one of the articles near the bottom half of the page under "Recommended Reading." Next to the protein IDs are figures showing the domain architecture of each protein. Clicking on a domain takes you to a description of this functional element found in several proteins. Note that many of the MSH2 proteins contain the same domains. To get the actual protein sequences, go to the top of the page where "Display HomoloGene" is shown and change HomoloGene to "FASTA" (a simple format of DNA and protein sequences). Those same sequences are here (with the description of each sequence simplified to the species of origin). In either case, the letters represent amino acids, shown in their single-letter code. The origin of this code shows the reasoning behind the choice of letters. WIBR Bioinformatics, © Whitehead Institute 2004

Multiple Sequence Alignments To compare these MSH2 homologs, we'll do a multiple species alignment of the protein sequences. The alignment will help show exactly how well the protein is conserved between species. Also, the alignment will create a phylogenetic tree (described in more detail below) that will the show the relative similarity of each MSH2 protein to those in the other species. To start copy all of the MSH2 proteins and go to ClustalW, a popular tool for multiple sequence alignment. ClustalW, by the way, can be used to align a set of either protein or DNA sequences, but we'll only be using it today to align protein sequences. In the big box under "Enter or Paste a set of Sequences in any supported format," paste all of the MSH2 sequences. You can leave all of the options with their default settings, and the email can be left blank. Then click on "Run" at the bottom. You'll need to wait about a minute for the alignment to be reloaded automatically in the next page. Look at the "ClustalW Results" page. Scroll down to the "Scores Table." Note that there's a score for each pair of sequences. ClustalW starts by comparing each sequence to each other in a pair-wise manner to get an alignment score for that simple alignment, and those are the scores that appear in this table. Either look through the list or try "Sort by" "Alignment Score." The MSH2 protein is most similar between which two species? To see the actual alignment in color, you have two choices: Alignment view: Scroll further down to the "Alignment" section and click on Show Colors. Jalview: Scroll to the top of the page and click on the "JalView" box. The colors are set to show the chemical properties of the amino acids, and Jalview only colors those amino acids which are conserved across the set of sequences. The Alignment view shows some symbols like ".:: *" underneath each part of the alignment. The asterisk shows an amino acid that's conserved across all sequences; two dots shows high conservation; and one dot shows lower conservation. If using Jalview, you can even edit the alignment by selecting an amino acid and sliding it to the left or right. This can be helpful, since generally automatic alignments are perfect, and some times you can see an obvious way to improve an alignment by hand. (But be careful if doing this with your own sequences; it's very easy to completely mess it up.) If you have time at the end of this session, you can experiment with the following on the Jalview pull-down menus: Color: Change the color scheme to show different information. Calculate > Pairwise alignments: See an alignment between any two sequences WIBR Bioinformatics, © Whitehead Institute 2004

Phylogenetic Sequence Trees Scroll down to the bottom of the ClustalW results page to the "Phylogram" section. This figure shows the distance (based on the alignment we performed) between each protein and the others. Is the shape of the tree what you'd expect? To answer this question, look at human, mouse, and rat. Does the relationship between the MSH2 proteins in these species agree with what you know about these organisms? What MSH2 protein is closest to the mosquito protein? Note that the phylogram shows the relationship between the species based on only one protein? How could you figure out the relationship between these organisms in more general terms? WIBR Bioinformatics, © Whitehead Institute 2004

Phylogenetic (Sequence) Trees • A Graph Representing The Evolutionary History Of Sequences • Relationship of sequences to one another (How everything is connected) • Dissect the order of appearance of insertions, deletions, and mutations • Identify Related Sequences, Predict Function, Observe Epidemiology (Analyze changes in viral strains) Sequence Tree A B C D WIBR Bioinformatics, © Whitehead Institute 2004

MSH2 Sequence Trees Can you put these organisms In order according to their MSH2 Similarity? Human Malaria parasite Mosquito Mouse Yeast S cerevisiae Rat Rice blast fungus Yeast S pombe Fruitfly Red bread mold Thale cress plant WIBR Bioinformatics, © Whitehead Institute 2004

Future Directions Comparative genomics involves the study of both proteins and genes. You can perform the same type of multiple sequence alignment using genes instead of proteins. If you go back to the HomoloGene page for MSH2, select "Display" "Nucleotide links" (and click on Display). Once on the new Entrez Nucleotide page, select Display FASTA and click on "Send to" Text. This file of multiple MSH2 genes can be pasted into the ClustalW page as you did earlier. Comparative genomics will only become more important as more and more species are sequenced. Supermodels have been or will soon be completely sequenced. With that information, a supermodel will continue to be powerful for both laboratory research and computational biology. As a result, there will be continued need for biologists who can investigate gene function using both the laboratory and the computer. WIBR Bioinformatics, © Whitehead Institute 2004

Computer Supermodels: Understanding Model Systems With Bioinformatics