560 likes | 692 Views
Taking the Bite (Byte?) Out of Phylogeny. Jennifer Galovich Lucy Kluckhohn Jones Holly Pinkart. Introduction. Goal is to produce an exercise that will engage allied health students and Strengthen math skills and decrease math phobia Decrease molecular data phobia
E N D
Taking the Bite (Byte?) Out of Phylogeny Jennifer Galovich Lucy Kluckhohn Jones Holly Pinkart
Introduction • Goal is to produce an exercise that will engage allied health students and • Strengthen math skills and decrease math phobia • Decrease molecular data phobia • Increase bioinformatics literacy
Prerequisites • The following will be presented to students prior to this project • Basic evolutionary concepts and use of 16S rRNA in determining relationships between prokaryotes • Introduction to Biology Workbench, BLAST and tree construction
Approach • Use the theme of food poisoning to engage both nursing and nutrition student populations • Utilize mathematics and bioinformatics tools
Approach • Students will pick a week in which food poisoning is likely; Christmas, 4th of July, Thanksgiving, etc. • Students will • identify a source of food poisoning (ex. Salmonella), and check the Morbidity and Mortality Weekly Report tables for the number of cases in a specific state or region • calculate proportion of cases represented by that region • Answer “Is this number of cases unusual based on the data presented for this time period? How can you tell?”
Approach • Students will then address the questions • “Without culturing the organism, how might you track it in humans or in a food supply?” • “What relationships (if any) exists between various strains of this organism”? • “Can this type of data be used to find the original strain?
Approach • Students will • obtain sequence data from NCBI’s GenBank for the organism (or virus) of interest • BLAST the sequence to find organisms with related sequences • Collect 8-13 of the closest BLAST results to perform a global alignment, and construct a tree
Questions Students choose a time period (week), search MMWR (Morbidity and Mortality Weekly Report) for the number of cases of a particular disease for a given week. • Given the chosen disease, how many cases of the disease occurred in a particular state (or other locale) during the week?
More Questions about the Scene 2a. How many persons are involved? Is there an index case? 2b. What percent of the population has the disease? 3. What other question might you ask from these data? 4. What microbe causes the disease? What strain, if appropriate?
Now What? (Questions about the microbe) 5. If you want to determine the specific strain of the microbe, can you find the genetic sequence? • How has the strain evolved? • What is its phylogeny, and what are the closest neighbors?
And Then. . . (Questions to Investigate) 8a. Why is the answer to the previous question of interest to you if you are a nurse, a dietician, a parent, the mayor, the hospital director, the first responder, a restaurant owner, a cruise ship director, a public health inspector, or other interested person (you choose)? 8b. What other questions are of interest to you in this role?
Finding the Microbe • Search MMWR Morbidity Tables http://www.cdc.gov/mmwr/distrnds.html
Choose a Week http://wonder.cdc.gov/mmwr/mmwrmorb.asp
Choose a Disease http://wonder.cdc.gov/mmwr/mmwr_reps.asp?mmwr_year=2006&mmwr_week=07&mmwr_table=2F
What Percent of the Residents are Sick? http://wonder.cdc.gov/mmwr/mmwr_reps.asp?mmwr_year=2006&mmwr_week=01&mmwr_table=2F
Find a Microbe • Use your text, class notes, or other resources to determine the causative agent of the disease you have chosen. • Choose a microbe, then find its family tree. • For the Salmonellosis example, we have chosen Salmonella enterica, a microbe with many variants, called serovars.
Basics of Tree Construction • Preliminary Exercises • Goal • Students will practice with small examples before trying to construct a tree • Students will learn phylogenetics notation and terminology (also see Glossary at end)
From Sequences to Pairwise Alignment The Needleman-Wunsch Method
We make a table of residue scores, S(i,j). The number S(i,j) is computed by comparing residue i in sequence (1) with residue j in sequence (2), using previously chosen values for matches and mismatches. Each alignment matrix entry, H(i,j), gives the score of the best alignment of the first i residues in sequence (1) with the first j residues of sequence (2) We have one row for each residue in sequence (2) and one column for each residue in sequence (1). To get started, we add a 0th row and a 0th column. The upper left corner is position (0,0). We set H(0,0) = 0. The rest of the values in the top row are (reading across) -g, -2g, -3g, etc. , where g is the gap penalty. Similarly, the rest of the values in the leftmost column are (reading down) –g, -2g, -3g, etc. To compute the value of H(i+1,j+1) we first consider the values north, west and northwest. We then find S(i+1,j+1) + the value immediately northwest (The value just north) – g (The value just west) – g The Needleman-Wunsch Method
G A A T T C 0 -2 -4 -6 -8 -10 -12 G -2 1 -1 G -4 -1 A -6 T -8 Distance Matrix • Then we choose the largest of these three numbers to be H(i+1,j+1) and draw an arrow from position (i+1,j+1) to the position that gave us the value of H(i+1,j+1). • Example: Let match = 1, mismatch = -1 and g = 2. Consider the sequences (1) G A A T T C (2) G G A T
Try This Exercise (at home ok) • Complete the table and then follow the arrows to determine the alignment : • A diagonal arrow corresponds to aligning the two letters. • A horizontal arrow corresponds to aligning a letter from (2) with a gap. • A vertical arrow corresponds to aligning a letter from (1) with a gap. • (Note that if you have ties, you may have more than one arrow, and so more than one “best” alignment.) • Redo this exercise with your own choice of match, mismatch and gap values. Experiment with these values to obtain alignments different from the ones you got in part (a).
From Pairwise Alignment to Multiple Alignment • Idea of global progressive alignment: Most alike sequences are aligned together in order of their similarity. A consensus is determined and then aligned to the next most similar sequence. The determination of “next most similar” is made using phylogenetic information (a guide tree).
There are many different ways of computing the distance between pairs of sequences in multiple alignment. Each uses different assumptions, which may or may not be reasonable for a given situation. For example, the simplest model, Jukes-Cantor, assumes that mutation occurs at a constant rate, and that each nucleotide is equally likely to mutate into any other nucleotide (at that rate). For protein sequences, the calculation is (even) more complicated. From Alignment to Distance Matrix
From distance matrix to tree Again, there are many different methods available. Biology Workbench uses ClustalW to construct multiple alignments. Clustal uses the neighbor joining methods to find the guide tree. The final tree produced by Workbench is a compilation of these guide trees.
The UPGMA (Unweighted Pair-Group Methods with Arithmetic means) method + easy to describe; produces an ultrametric (and hence additive) tree - assumptions (molecular clock; all species evolve at the same rate) General idea: Step 1. Find the two closest taxa. Step 2. Treat the two closest as a new combined taxon, and make a new matrix, calculating distances from the combined taxon to the others using the average of all the pairwise distances involved. Iterate these two steps until the tree is completed. Clustering Methods
Construct the UPGMA tree for the following distance matrix: A/D B C A B C D A/D 0 19/2 15/2 A 0 9 7 5 B 0 8 B 9 0 8 10 C 0 C 7 8 0 8 D 5 10 8 0 Observe: A and D are closest Next, update the matrix Now the A/D cluster and C are closest.
Finish constructing this tree. The tree is ultrametric, but the data are not. (Why not?) How would the data have to be changed in order that they be ultrametric? The tree is additive. Are the data? Now, redo questions 1 – 3 in case the BD distance is 12 instead of 10. Exercises
Neighbor Joining (NJ) • + additive (but not ultrametric); computationally efficient • - unrooted. Prior knowledge is needed to decide how to root the tree. • Note: the species which are closest according to the distance matrix need NOT be neighbors. That’s why we need a modified distance formula • Exercise: Draw a picture of a tree on four taxa that illustrates the problem described in the note above.
Constructing a Neighbor Joining Tree Step 1: Find the two taxa which are closest using the modified distance formula below. Join them. • To find the modified distance from node i to node j: Let N be the number of taxa. Let R_i = sum of all the distances from node i to all others except node j, divided by N – 2 Let R_j = sum of all the distances from node j to all others except node i, divided by N – 2 Let D(i,j) = matrix distance. Calculate modified distance, D*, from i to j as D*(i,j) = D(i,j) – R_i – R_j. For example, using the distance matrix we used earlier, D*(A,B) = 9 – 6 – 9 = -6.
i j NJ (continued) Step 2: Suppose that nodes i and j give the smallest value of D*. Start the tree by joining those nodes to a new node. Call the new node (ij). We now have two fewer taxa and one more internal node, for a net of one less node than we started with. Step 3: Now, as in the UPGMA method, we make a new matrix showing the distances to all the nodes except i and j. Problem: the new internal node (ij) is not in the original matrix. (ij)
This problem can be solved • Step 4: To update the matrix, you will need to compute the distance from the new internal node (ij) to the remaining nodes. For each remaining node k, compute the new distance as ½ [D(i,k) + D(j,k) – D(i,j)] • Step 5: Apply steps 1 – 4 to the revised matrix.
Practice the NJ method on the matrix we had earlier. Now try both methods using the matrix to the right. Why do you get different trees? Exercises
Final Approach • Use the theme of food poisoning to engage both nursing and nutrition student populations • Utilize mathematics and bioinformatics tools
Find the Microbial Gene • NCBI Search http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide
Choose a Strain http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=nucleotide&cmd=search&term=Salmonella+enterica+16s+ribosomal+RNA+gene
BLAST • Basic Local Alignment Search Tool http://www.ncbi.nlm.nih.gov/BLAST/
Paste Sequence, BLAST off! http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?CMD=Web&LAYOUT=TwoWindows&AUTO_FORMAT=Semiauto&ALIGNMENTS=50&ALIGNMENT_VIEW=Pairwise&CLIENT=web&DATABASE=nr&DESCRIPTIONS=100&ENTREZ_QUERY=%28none%29&EXPECT=10&FILTER=L&FORMAT_OBJECT=Alignment&FORMAT_TYPE=HTML&NCBI_GI=on&PAGE=Nucleotides&PROGRAM=blastn&SERVICE=plain&SET_DEFAULTS.x=34&SET_DEFAULTS.y=8&SHOW_OVERVIEW=on&END_OF_HTTPGET=Yes&SHOW_LINKOUT=yes&GET_SEQUENCE=yes
GenBank http://www.ncbi.nlm.nih.gov/entrez/viewr.fcgi?db=nucleotide&val=88604678
FASTAhttp://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&qty=1&c_start=1&list_uids=88604678&dopt=fasta&dispmax=5&sendto=&from=begin&to=end&extrafeatpresent=1&ef_CDD=8&ef_MGC=16&ef_HPRD=32&ef_STS=64&ef_tRNA=128FASTAhttp://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&qty=1&c_start=1&list_uids=88604678&dopt=fasta&dispmax=5&sendto=&from=begin&to=end&extrafeatpresent=1&ef_CDD=8&ef_MGC=16&ef_HPRD=32&ef_STS=64&ef_tRNA=128
Constructing a Tree • Add sequences • http://seqtool.sdsc.edu/CGI/BW.cgi#!
Clustal W • Choose the Multiple Sequence Alignment http://seqtool.sdsc.edu/CGI/BW.cgi#!
Choose a Tree Type • Choose Rooted and/or Unrooted • Submit http://seqtool.sdsc.edu/CGI/BW.cgi#!
Voila! • Unrooted Tree http://seqtool.sdsc.edu/CGI/BW.cgi#!
Rooted Tree • Which species are the most closely related? http://seqtool.sdsc.edu/CGI/BW.cgi#!
Final Questions • How are the data helpful if you are a • Parent? • Restaurant owner? • Hospital director? • Public health inspector?
Assessment • Student Learning Outcomes • More comfortable with computation • Using the tools to answer questions • Empowerment (we hope!)
References -- Texts • Emphasis on algorithms: • Neil C. Jones and Pavel A. Pevzner, An Introduction to Bioinformatics Algorithms • Michael S. Waterman, Introduction to Computational Biology • Bio/Math Balanced: • Paul G. Higgs and Teresa K. Attwood, Bioinformatics and Molecular Evolution • The Bible of Phylogenetics: • Joseph Felsenstein, Inferring Phylogenies
References -- Websites • http://mbi.ohio-state.edu/2005/tutorials2005.html (tutorial on tree construction) • http://bioalgorithms.info/courses.php (list of links to bioinformatics course websites) • http://tree-thinking.org/ (resources for learning and teaching)