260 likes | 432 Views
Practical on phylogenetic trees based on sequence alignments . Kyrylo Bessonov November 26th, 2013. Talk plan . How to build phylogenetic trees of types Unrooted Rooted Context comparison of viral proteins of dengue virus Examples on phylogenetic tree building Dengue virus.
E N D
Practical on phylogenetic trees based on sequence alignments KyryloBessonov November 26th, 2013
Talk plan • How to build phylogenetic trees of types • Unrooted • Rooted • Context • comparison of viral proteins of dengue virus • Examples on phylogenetic tree building • Dengue virus
Building a phylo tree using ape • Ape - Analyses of Phylogeneticsand Evolution • Functions to create and manipulate phylo trees • Graphical exploration of phylogenetic data • To build a phylogenetic tree • Download protein sequences from DB • Align sequences • Calculate pairwise distance using ape • Visualize a phylogenetic tree
Building an unrooted phylogenetic tree (1) #install req. libraries install.packages("seqinr") install.packages("muscle") install.packages("ape") library("seqinr") library("muscle") library("ape") multipleSeqAlignment <- function (seqnames, seqs){ #umax is an object of class fasta from muscle package fasta_seqs_Object=umax; tmp=data.frame(V1=rep(0,length(seqs)),V2=rep(0,length(seqs))) for(i in 1:length(seqs)){ tmp[i,1]=seqnames[i] tmp[i,2]=paste(seqs[[i]],collapse="") } fasta_seqs_Object$seqs=tmp #multiple sequence alignment #remove conflicting ape library from the memory try(detach("package:ape"), silent=T) alignment=muscle(seqs=fasta_seqs_Object, out = NULL) alignment_ape=ape::as.alignment(matrix(alignment$seqs[,2])) alignment_ape$nam=alignment$seqs[,1] return (alignment_ape) }
Building an unrooted phylogenetic tree (2) #main part of the code choosebank("swissprot") #selects database for query seqnames <- c("P06747", "P0C569", "O56773", "Q5VKP1") seqs=list() for(i in 1:length(seqnames)){ query <- query(paste("AC=",seqnames[i],sep="")) seqs[i]=getSequence(query) } #multipleSeqAlignment() is defined on previous slide alignment_ape<- multipleSeqAlignment(seqnames, seqs); mydist<- dist.alignment(alignment_ape) #nj() performs the neighbor-joining tree estimation by Saitou and Neimytree<- nj(mydist) mytree$tip.label=c("Q5VKP1-\nWestern Caucasian bat virus\nphosphoprotein","P06747-\nrabies virus\nphosphoprotein","P0C569-\nMokola virus\nphosphoprotein","O56773-\nLagos bat virus\nphosphoprotein") plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=0.8, no.margin=T, srt=50)
Unrooted Phylogenetic Tree • Phylogenetic tree showing distance between 4 protein viral sequences • the genetic distance between O56773 and P0C569is the smallest
Unrooted phylogenetic tree (1) • The lengths of the branches in the plot of the tree are proportional to the amount of evolutionary change (estimated by number of mutations) along the tree branches • This is an unrootedphylogenetic tree as it does not contain an outgroup sequence, that is a sequence of a protein that is known to be more distantly related to the other proteins in the tree than they are to each other.
Unrooted phylogenetic tree(2) • As a result, we cannot tell which direction evolutionary time ran in along the internal branches of the tree. For example, we cannot tell whether the node representing the common ancestor of (O56773, P0C569) was an ancestor of the node representing the common ancestor of (Q5VKP1, P06747), or the other way around.
Distance matrix • Inspecting calculated distance matrix between aligned sequences confirms results seen in phylogenetic tree • Closest pair is O56773 and P0C559 proteins
Rooted phylogenetic tree • In order to convert the unrooted tree into a rooted tree, we need to add an outgroupsequence • Outgroup • a taxon outside the group of interest • will branch off at the base of phylogeny • Caenorhabditiselegans (UniProt accession Q10572 and Caenorhabditisremanei (UniProt E3M2K8) • If we were to build a phylogenetic tree of the Fox-1 homologues in verterbrates, the distantly related sequence from worms would probably be a good choice of outgroup, since the protein is from a different taxa/group (worms)
Building an rootedphylogenetic tree (1) #BUILDIN ROOTED TREE OF PROTEIN SEQUNCES (FOX1) #Q9NWB1 - Human #Q17QD3 - Cow #Q95KI0 - Monkey #A1A5R1 - Rat #Q10572 - Worm C.elegans(Root) #E1G4K8 - Eye worm seqnames <- c("Q9NWB1","Q17QD3","Q95KI0","A1A5R1","Q10572","E1G4K8") choosebank("swissprot") #selects database for query seqs=list() for(i in 1:length(seqnames)){ query <- query(paste("AC=",seqnames[i],sep="")) seqs[i]=getSequence(query) } alignment_ape <- multipleSeqAlignment(seqnames, seqs); mydist <- dist.alignment(alignment_ape)
Building an rooted phylogenetic tree (2) library("ape") mytree <- nj(mydist) mytree$tip.label=c("E1G4K8-Eye worm ", "Q10572-C.elegans(Root)", "A1A5R1-Rat", "Q9NWB1-Human", "Q17QD3-Cow", "Q95KI0-Monkey") myrootedtree <- root(mytree, outgroup="Q10572-C.elegans(Root)", r=TRUE) #Phylogenetic tree with 6 tips and 5 internal nodes. #Tip labels: #[1] "E1G4K8" "Q8WS01" "Q9VT99" "A8NSK3" "Q10572" "E3M2K8" #Rooted; includes branch lengths. plot.phylo(myrootedtree, edge.color = "blue", edge.width = 3 , type="p")
Rooted tree of FOX1 proteins • The invertebrates are grouped together • Worms form a distinct group yet with large genetic distance • Human FOX1 is closest to monkey and cow sequences outgroup (worms)
Distance matrix Table legend: Q9NWB1 – Human Q95KI0 – MonkeyQ10572 - Worm C.elegans (Root) Q17QD3 – Cow A1A5R1 – Rat E1G4K8 - Eye worm • As expected, eye worms are the mostly distantly related species to vertebrates • Cow and monkey have the closest relationship and the lowest genetic distance
Rooted tree • Time runs from left to right • Monkey, Cow and Human have common ancestor 3 • Ancestor 1 is common to ancestors 2 and 3 TIME
Exercises on phylogenetic tree building • Q1. Calculate the genetic distances (i.e. genetic distance) between the following NS1 proteins from different Dengue virus strains: Dengue virus 1 NS1 protein (Uniprot ID: Q9YRR4), Dengue virus 2 NS1 protein (UniProt: Q9YP96), Dengue virus 3 NS1 protein (UniProt: B0LSS3), and Dengue virus 4 NS1 protein (UniProt: Q6TFL5). Which viruses are the most closely related, and which are the least closely related, based on the genetic distances? Note: Dengue virus causes Dengue fever, which is classified by the WHO as a neglected tropical disease. There are four main types of Dengue virus, Dengue virus 1, Dengue virus 2, Dengue virus 3, and Dengue virus 4. • Q2. Build an unrooted phylogenetic tree of the NS1 proteins from Dengue virus 1, Dengue virus 2, Dengue virus 3 and Dengue virus 4, using the neighbour-joining algorithm. Which are the most closely related proteins, based on the tree?
Exercises on phylogenetic tree building • Q3. The Zika virus is related to Dengue viruses, but is not a Dengue virus, and so therefore can be used as an outgroup in phylogenetic trees of Dengue virus sequences. UniProt accession Q32ZE1 consists of a sequence with similarity to the Dengue NS1 protein, so seems to be a related protein from Zika virus. Build a rooted phylogenetic tree of the Dengue NS1 proteins based on an alignment, using the Zika virus protein as the outgroup. Which are the most closely related Dengue virus proteins, based on the tree? What extra information does this tree tell you, compared to the unrooted tree in Q2?
Answers Question 1: Summary of viral proteins and Uniprot accession numbers: Uniprot ID: Q9YRR4 Dengue virus 1 NS1 protein UniProt: Q9YP96Dengue virus 2 NS1 protein UniProt: B0LSS3 Dengue virus 3 NS1 protein UniProt: Q6TFL5 Dengue virus 4 NS1 protein seqnames <- c("Q9YRR4","Q9YP96","B0LSS3","Q6TFL5") choosebank("swissprot") #selects database for query seqs=list() for(i in 1:length(seqnames)){ query <- query(paste("AC=",seqnames[i],sep="")) seqs[i]=getSequence(query) } alignment_ape <- multipleSeqAlignment(seqnames, seqs); mydist <- dist.alignment(alignment_ape); mydist
Answers • Q1. The distance matrix is as follows The most distant are Q9YP96(V2) and Q6TFL5(V4) with genetic distance of 0,33 while the most closely related are Q9YP96(V1) and BOLSS3(V3) with genetic distance of 0,227
Answers Question 2: library("ape") mytree <- nj(mydist) #plotting unrooted tree plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=1.2, no.margin=T, srt=0) #clean the sequences from gaps seqs_trim=seqs for(i in 1:length(seqs)){ start=regexpr("DMGY", paste(seqs_trim[[i]],collapse="") ) [1] stop=regexpr("GEDG", paste(seqs_trim[[i]],collapse="") ) [1] seqs_trim[[i]]=seqs_trim[[i]][start:stop] } alignment_ape <- multipleSeqAlignment(seqnames, seqs_trim); mydist <- dist.alignment(alignment_ape);mydist library("ape") mytree <- nj(mydist) #plotting unrootedtree based on alignment of whole protein sequences plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=1.2, no.margin=T, srt=0)
Answers Question 2 (continued): alignment_ape<- multipleSeqAlignment(seqnames, seqs_trim); mydist <- dist.alignment(alignment_ape);mydist library("ape") mytree <- nj(mydist) #tree based on the best aligned portion plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=1.2, no.margin=T, srt=0)
Answers • The resulting Q2 un-rooted tree This un-rooted tree agrees with the genetic distance matrix calculated in Q1. The tree suggests that BOLSS3 and Q9YP96 are the mostly related proteins. To improve quality of the tree it is best to select region that has minimal number of gaps between protein sequences Below you can see that there are regions with lots of gaps. Let’s build another tree based on the bolded(most conserved) region to see if it is the same Built using the full lengths of proteins Alignment of proteins: Q6TFL5 DMGCVVSWNGKELKC…KDQKAVHADMGYWIESSKNQTWQIEKASLIEVKTCLWPKTHTL…GMEIRPLSEKEENMVKSQVTA Q9YRR4------------------------DMGYWIESEKNETWKLARASFIEVKTCIWPKSHTL…GMEI----------------- Q9YP96DSGCVVSWKNKELKC…KDNRAVHADMGYWIESALNDTWKIEKASFIEVKNCHWPKSHTL…GMEIRPLKEKEENLVNSLVTA B0LSS3--------------------ASHADMGYWIESQKNGSWKLEKASLIEVKTCTWPKSHTL…------------------------
Answers • The resulting tree looks the same but we had achieved overall better resolution between proteins Whole protein sequences used Best aligned portion of protein sequences used Built using the bolded region
Answers Question 3: #Q3 building rooted tree based on Q89277 (yellow fever virus) as out group library("seqinr") library("muscle") library("ape") seqnames <- c("Q9YRR4","Q9YP96","B0LSS3","Q6TFL5", "Q89277") choosebank("swissprot") #selects database for query seqs=list() for(i in 1:length(seqnames)){ query <- query(paste("AC=",seqnames[i],sep="")) seqs[i]=getSequence(query) } alignment_ape <- multipleSeqAlignment(seqnames, seqs); mydist <- dist.alignment(alignment_ape);mydist library("ape") mytree <- nj(mydist) myrootedtree <- root(mytree, outgroup="Q89277", r=TRUE) plot.phylo(myrootedtree ,type="p", edge.color = "blue", edge.width = 3, cex=1.2, no.margin=T, srt=0)
Answers outgroup • Q3 asks to build a rooted tree using out-group yellow fever virus (Q89277) • Most closely related viruses: • BOLSS3 and Q9YP96 • This rooted tree tells you which of the Dengue virus NS1 proteins branched off the earliest from the ancestors. Unrooted tree does not provide ancestry information (i.e. time sequence)
References • Ape library for phylogenetic trees and ancestry with bootstrap methods http://cran.r-project.org/web/packages/ape/ape.pdf