MBoMS Genomics of Model Microbes Lab 4: Multiple sequence alignment

MBoMS Genomics of Model Microbes Lab 4: Multiple sequence alignment

Multiple Sequence Alignments • Your next task will be to create multiple sequence alignments • This is one of the most challenging steps in the study of molecular evolution • Too often, a person will simply feed several sequences into an alignment program and accept whatever is produced • In MANY cases, the alignment will contain serious errors and may include sequences that are not homologs…..

Overview of Multiple Alignments • If you ask an alignment program to align two sequences, it will • That does not mean the sequences are homologs, just that the algorithm has found an alignment which maximizes the number of identical and similar residues or nucleotides • If you are searching for similar domains or motifs, GREAT…the alignment program may have done a decent job • However, in many, many cases, the alignment program will only have created a starting point -- you will have to use your knowledge of the molecule or intuition or instincts to optimize the alignment

Use Your Brain • Once you have an alignment, look at it • Do the gaps seem reasonable, or does it look like you could have done a better job by eye? • If yes, then you should try to do a better job • Are there large regions absent in one sequence? • If yes, you may need to delete those regions in the input file • Is one sequence much shorter? • If yes, you should crop the others or eliminate the short one from consideration

More On Alignments • The ideal result is an alignment in which all bases aligned are homologs and all insertions and deletions represent real events in the evolution of the sequences • Since we cannot know, we settle for a reasonable approximation by assigning and adjusting gap penalties • With unlimited penalties, we could align any two unrelated sequences perfectly • Alignment programs prevent that by penalizing the alignment score for each gap and for each additional residue in a gap

Gap penalties • Can we improve our alignment? • We begin by increasing or decreasing gap penalties and multiple alignment penalties in the program • Each time, we should print out the alignment and see how it compares to the previous one • Although time consuming, this is the single most important thing you can do to ensure you have the best possible alignment with that data set

Advice on alignments • Treat them cautiously • Can usually be improved by eye • Often helps to have color coding • Depending on use, the user should be able to make a judgment on those regions that are reliable or not • For phylogenetic reconstruction, only use those positions whose hypothesis of positional homology is unimpeachable

Exercise 1 • The first target gene is Efp • Go to the NCBI main page and type efp into the search engine and click go. • You will be taken to entrez and there will lots of information about the gene • Learn a bit about the gene and encoded protein • Try to find the gene in a genome from each of your two species • Try genome blast (found on the microbial genome resource page, in the tool box) • Do both species have the gene? • Is the gene similar in the two species? • Cut and paste the protein sequences into a word document

Exercise 2 • The next exercise involves creating multiple alignments for this target protein • These alignments should be fairly simple, since the proteins should be either identical or very similar when compared from within a species

Exercise 2, cont. • Start with Efp • You have already found this protein in the genome of each of your two species • Now, find the same protein in two more genomes for each species • Paste into the word file all three copies of the target protein from your first species • Repeat with a new word file and paste into the file all three copies of the target protein from your second species

Exercise 2, cont. • Now edit each word file to look exactly like this (except use YOUR species designation, not E. coli’s): >Ecefp1 PROTEIN SEQUENCE ***IN ALL CAPS*** BLANK LINE >Ecefp2 PROTEIN SEQUENCE BLANK LINE >Ecefp3 PROTEIN SEQUENCE

Sample input file >Ecefp1 YQHVKPGKGAAFVRAKIKSFLDGKVIEKTFHAGDKCEEPNLVEKTMQYLY HDGDTYQFMDIESYEQIALNDSQVGEASKWMLDGMQVQVLLHNDKAISVD VPQVVALKIVETAPNFKGDTSSASKKPATLETGTVV >Ecefp2 YQHVKPGKGAAFVRAKIKSFLDGKVIEKTFHAGDKCEEPNLVEKTMQYLY HDGDTYQFMDIESYEQIALNDSQVGEAKKWMLDGMQVQVLLHNDKAISVD VPQVVALKIVETAPNFKGDTSSASKKPATLETGAVV >Ecefp3 YQHVKPGKGAAFVRAKIKSFLDGKVIEKTFHAGDKCEEPNLVEKTMQYLY HDGDTYQFMDIESYEQIALNDSQVGEASKLMLDGMQVQVLLHNDKAISVD VPQVVALKIVETAPNFKGDTSSASKKPATLETGAVV

Exercise 3 • Now go to the CLUSTALW web site: • http://www.ebi.ac.uk/Tools/clustalw2/index.html • This software provides a robust sequence alignment algorithm • At the top of the page, there is a box to insert your sequences • Simply cut and paste the three sequences from the first word document and click run

Sample Clustal Output Results of search Number of sequences 3 Alignment score 2403 Sequence format pearson Sequence type aa Jalview tab: start Jalview Output file clustalw2-20080328-01104802.output Alignment file clustalw2-20080328-01104802.aln Guide tree file clustalw2-20080328-01104802.dnd Your input file clustalw2-20080328-01104802.input

Sample Clustal Output SeqA Name Len(aa) SeqB Name Len(aa) Score ============================================================= • Ecefp1 136 2 Ecefp2 136 98 1 Ecefp1 136 3 Ecefp3 136 98 • Ecefp2 136 3 Ecefp3 136 98 ============================================================= Ecefp1 YQHVKPGKGAAFVRAKIKSFLDGKVIEKTFHAGDKCEEPNLVEKTMQYLYHDGDTYQFMD 60 Ecefp2 YQHVKPGKGAAFVRAKIKSFLDGKVIEKTFHAGDKCEEPNLVEKTMQYLYHDGDTYQFMD 60 Ecefp3 YQHVKPGKGAAFVRAKIKSFLDGKVIEKTFHAGDKCEEPNLVEKTMQYLYHDGDTYQFMD 60 ************************************************************ Ecefp1 IESYEQIALNDSQVGEASKWMLDGMQVQVLLHNDKAISVDVPQVVALKIVETAPNFKGDT 120 Ecefp2 IESYEQIALNDSQVGEASKLMLDGMQVQVLLHNDKAISVDVPQVVALKIVETAPNFKGDT 120 Ecefp3 IESYEQIALNDSQVGEAKKWMLDGMQVQVLLHNDKAISVDVPQVVALKIVETAPNFKGDT 120 *****************.* **************************************** Ecefp1 SSASKKPATLETGTVV 136 Ecefp2 SSASKKPATLETGAVV 136 Ecefp3 SSASKKPATLETGAVV 136 *************:**

Sample CLUSTAL OutputPhylogenetic tree Ecefp1 Ecefp2 Ecefp3

Exercise 4 • Gap Penalties and Extensions • CLUSTAL employs a set of default values for gap and extension penalties • What are these? • Your next task is to try several larger and smaller values for the gap and extension penalties. • Did your alignment change with changes in these penalties? • Decide what are the best values for each and write a paragraph in your lab notebook about how you decided what was best

Exercise 5 • Repeat this exercise with each of the target proteins and with both of your species • Gene names: InfB, rpoC, accA, thyA, purA • Find the genes, paste the encoded protein sequence into your word files • If one or both of your species is missing any of these genes/proteins, here are some alternates: atpA, tpiA, pheS • Input/Output • A total of 12 alignment files will be made • Proteins 1 - 6 for species A • Proteins 1 - 6 for species B • A total of 12 alignments and 12 trees will be produced

Exercise 6 • Think about for next lab how we will talk about the following: • The entire class will discuss the challenges they encountered with their multiple alignments • We need to reach a consensus on how we aligned these proteins in different species • Are some species more difficult than other? • Why might that be? • Are certain proteins not as useful in this exercise? • Why might that be? • Do we need to add or subtract any of our proteins?

MBoMS Genomics of Model Microbes Lab 4: Multiple sequence alignment