1 / 1

Trying to reconstruct the history of genes families Roberto Marangoni*^, Nadia Pisanti*, Paolo Ferragina*^, Antonio Fra

Trying to reconstruct the history of genes families Roberto Marangoni*^, Nadia Pisanti*, Paolo Ferragina*^, Antonio Frangioni*, Fabrizio Luccio*^ *Dept. of Informatics, University of Pisa, Italy ^ C .I.S.S.C. (Interdisciplinary Center for Complex Systems Study), University of Pisa, Italy.

avery
Download Presentation

Trying to reconstruct the history of genes families Roberto Marangoni*^, Nadia Pisanti*, Paolo Ferragina*^, Antonio Fra

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Trying to reconstruct the history of genes families Roberto Marangoni*^, Nadia Pisanti*, Paolo Ferragina*^, Antonio Frangioni*, Fabrizio Luccio*^ *Dept. of Informatics, University of Pisa, Italy ^C.I.S.S.C. (Interdisciplinary Center for Complex Systems Study), University of Pisa, Italy. E-mail: marangon@di.unipi.it • 1.Evolution, Information and Complexity • These three concepts are hard to define, when referred to a biological organism. We can give only “working” definitions like: • Evolution, recalling the classic Darwinian definition, it is “descent with modifications” (i.e., sons are not equal to fathers; but up to now, there is no generally accepted definition of biological evolution). • Information; when we refer it to a biological organism, we can define it as the information stored in the genome, even if it is not completely true, since the development of an organism is specified not only by the DNA, but also by the concurrence of mother-RNA, proteins and other factors. • Complexity: a tentative definition can be found following something like an “algorithmic” approach. One can ask: how many words are enough to describe a bacterium? And, how many to describe a human? Of course, in the last case, many more words are needed. One can say, in this case, that a human is more complex than a bacterium. • Biological evolution has generated more and more complex organisms; but to a high complexity corresponds a high information content; and therefore the general problem for the biological evolution moves to: • How does information increase during evolution? • 2.Duplications, genes families and paralogs • There are two kinds of mechanisms described in literature, able to create new information in genomes: • Exogenous mechanisms, like horizontal transfers and transfections, the final result of which is the insertion into a genome of a DNA segment coming from another specie. Even if this kind of process is important, its quantitative contribution to a genome is not so relevant. • Endogenous mechanisms, mostly represented by duplications. They have been described whole-genome-, large-segment-, tandem- and single gene-duplications. Duplications make genomes clusterizable into genes families. Usually, members of genes families are sharing a high homology in their sequence, and, when they are functionally active, they perform very similar biological functions: they are called paralog genes or simply paralogs. Endogenous mechanisms represent quantitatively the most important process that leads to an increase of the genomic information. • In order to better understand the mechanism by which genomes have increased their size and multiplied their functional capabilities, it is necessary to study the behavior of duplication events; the first step is to investigate the history of a genic family: how many duplication events have occurred, in which order, etc. • How to reconstruct the history of genic families? • 3.Building a paralogy tree • To reconstruct the history of genes families, under the hypothesis that every family member derives from a duplication process of another member, means to put the set of members into a tree, that we call paralogy tree, in which the root represents the most ancient gene of the family, and each directed arc represents the relationship matrix-copy in a duplication process. • This is not a phylogenesis study and this is not a phylogenetic tree!!! • Differently from philogenetic studies, in which one measure the similarity between two or more sequences, in order to infer which could had been the possible common ancestor, in this kind of study we need to use an asymmetric function to compare two sequences, which is able to express which sequence could have been the matrix and which copy, in an hypothetical duplication process. • This kind of function has to address two basic biological requirements, which derive from the presently known duplications: • Copies are usually shorter than matrixes, since the event of a segment insertion after a duplication is a rare event. • To insert segments has metabolic costs, while to delete segments has no cost. • We have developed a method for paralogy tree construction, based on Transformation Distance (TD) [1] as the basic function to compare two sequences, since it addresses the biological requirements stated above. • How to obtain the paralogy tree? • 4.PaTre • The method we present, called PaTre, is made up of the following steps: • Input: all the paralog sequences of a family; • Computation of the TD values for each possible couple of paralogs in the inputset and construction of the directed graph (see fig. 1) that expresses, for each couple, the probability of the relationship matrix-copy/copy-matrix. • Extraction of the Lightest Spanning Arborescence (LSA) by means of Edomond’s algorithm [2,3]. • We assume the extracted LSA as the paralogy tree (fig. 2) • We ask PaTre to give an output not only of the optimal solution, but also the sub-optimal ones, which are useful in the following. Fig. 1: example of a directed graph 5. Testing PaTre Unfortunately, there are NO documented histories of genes families in nature, so that we used a simulation procedure to test PaTre. We have therefore developed a simulator, that receives in input a gene, and generates (by iterating the duplication-with-modification mechanism) a family of simulated paralogs, the history of which is, of course, known (fig. 3). 5b. Testing the simulator To test how the simulated data are similar with respect to real ones, we run the simulation on different gene families, starting from different sequences, and then we use a standard clustering algorithm, giving an input of both simulated and real sequences. The results show that, for several of the most diffused genes families, the generated clusters contain both simulated and real sequences, thus demonstrating a good degree of “mimetic” capability of the simulator. As in the example shown in fig. 4, where the simulated sequences have the name “str##” and the real sequences are named “AF…” Fig. 4: similarity tree computed on simulated and real sequences Fig. 2: example of PaTre output Fig. 3: example of simulator output 6. Applications on simulated data Applying PaTre on simulated families, we always get the corrected tree. Fig. 5 compares the simulated tree with the one reconstructed by PaTre: they are completely overlapping each other. We have tested PaTre on more than 60 families in 20 different organisms. PaTre passes the test on simulated data 7. Using similarity-based algorithms If we used a simulated family as input for a similarity-based algorithm like ClustalW, and try to generate something like a phylogenetic tree based on that data, we get a tree that is completely different from the true one. Fig. 6 shows the output of ClustalW obtained on the same input set of the example in Fig. 5: it is not the expected tree. Similarity based algorithms are not suitable to reconstruct the history of genes families. output from PaTre for the simulated Ribosomal Protein of M. pneumoniae The simulated paralogy tree for the Ribosomal Proteins family of M. pneumoniae Fig. 5 (see above) Fig. 4: the tree reconstructed by ClustalW on the same data of Fig. 5 • 8. Applications to real cases • We have applied PaTre to some real cases in which experimental evidence have given suggestions about the possible history of genes families. In particular, we have tested: • Bacterial duplications, in which PaTre has always identified a duplication process that linked two genes known as duplicated genes. • The Shaggy/GSK3 family in Arabidopsis thaliana, where the evidences of some duplication events [4] have been confirmed by the paralogy tree reconstructed by PaTre (Fig. 7a and 7b) • The degree of reliability of PaTre is also supported by experimental evidence • 9. Open problems • There are still several open problems concerning, in particular: • a detailed study of the robustness of PaTre; • a method to take into account Steiner points; • a design of an optimal distance to use in the all-against-all comparison • Further work is required, of course Fig. 7: a) the Shaggy/GSK3 family of Arabidopsis thaliana in a similarity tree: the clouds identify duplication events; b) the paralogy tree reconstructed by PaTre • 10. Future developments • Trees comparisons: • the probability to choose a gene as matrix for a new duplication: does it depend on the gene “age” or not? (different answers lead to different trees…); • use paralogy trees built for the same family in different organisms to extract phylogenetic information. • If we have grants, we will do everything!. • References • J.S. Varré, J.P. Delahaye, E. Rivals, The Transformation Distance: a dissimilarity measure based on movements of segments, German Conference on Bioinformatics, Köln, 1998. • J. Edmonds, Optimum branchings, J. Res. Nat. Bur. Standards, 71B, 223-240, 1967. • R.E. Tarjan, Finding optimum branchings, Network, 7, 25-35, 1977. • R. Tavares, Contribution a la caracterisation de la sous-famille des proteines serine/threonine kinases du type SHAGGY/GSK-3 chez Arabidopsis thaliana; University of Paris sud, 2000.

More Related