610 likes | 821 Views
HOGENOM a phylogenomic database. Simon Penel , Pascal Calvat, Jean-Francois Dufayard , Vincent Daubin , Laurent Duret , Manolo Gouy , Dominique Guyot, Daniel Kahn, Vincent Miele, Vincent Navratil , Guy Perrière, Rémi Planel. Several phylogenomic databases developed at LBBE/PRABI.
E N D
HOGENOM a phylogenomicdatabase Simon Penel, Pascal Calvat,Jean-FrancoisDufayard, Vincent Daubin, Laurent Duret,Manolo Gouy, Dominique Guyot,Daniel Kahn, Vincent Miele,Vincent Navratil, Guy Perrière, Rémi Planel
Severalphylogenomicdatabasesdevelopedat LBBE/PRABI • HOVERGEN • VerterbrateProteinsfromUniProt • ClusteringwithSiLiX • HOMOLENS • ProteinsfromEnsembl Complete Genomes • ClusteringfromEnsembl • Treescalculated and annoated (S,D,L) with new methods (PhylDog,LBBE) • HOGENOM • Proteinsfrom all availablecompletegenomes • (Bacteria, Eukaroyota, Archaea) • ClusteringwithSiLiX and post-processingwithHiFiX • Treeswillbeannotated (S,D,L,T)
HOGENOM characteristics • all completegenomesfrom the wholetree of life (not restricted to particular phylum) • Propose « genefamilies » : full lengthhomologoussequences (different of « domainfamilies »)
Domain vs.genefamilies Proteindomainfamily Families of homologous protein domains (ProDom): - Evolution by domain shuffling (duplication, loss, translocation)
Domain vs.genefamilies Homologousgenefamily Families of homologous protein domains (ProDom): - Evolution by domain shuffling (duplication, loss, translocation) Homologous Gene families (HOGENOM): - Evolution of homologous genes by speciation or by gene duplication, or horizontal transfer - Sequences are homologous over their entire length (or almost)
Orthologs and paralogs in HOGENOM HOGENOM is centered on phylogenetic trees of gene families. Information on orthologs and paralogs can be deduced from gene trees: - from the annotation of gene trees (Duplication, Speciation, Transfer) - from query tools such as tree-pattern matching
Building Compare all proteinsagainsteachother (BLAST) Cluster homologoussequencesintofamilies (SILIX + HIFIX) Compute multiple alignments for eachfamily Computephylogenetictrees for eachfamily Annotatephylogenetictrees (gene duplications, losses, transfers)
Compare all proteinsagainsteachother • Iterative BLAST calculation • Use of a non-redundantproteinsequencedatabase … (all know proteins , about 20,000,000 non redondant sequences) … associatedwitha resultingBLAST hits database (fromwhich blast hits maybeextracted) • Cluster, grid and cloudcomputing
Building Compare all proteinsagainsteachother (BLAST) Cluster homologoussequencesintofamilies (SILIX + HIFIX) Compute multiple alignments for eachfamily Computephylogenetictrees for eachfamily Annotatephylogenetictrees (gene duplications, losses, transfers)
Local pairwise alignments Proteindatabase SiLiX 1ststep : similaritysearch BLASTP BLOSUM62 E ≤ 10-4
SiLiX 2ndstep : SiLiXclustering Use the all-against-all BLAST hits
S3 S1 S2 S4 Seq. A Seq. B S1’ S2 Seq. A Seq. B ∆lg3 ∆lg1 lgHSP1 ∆lg2 lgHSP2 SiLiX SiLiX : Selection of consistent HSPs
A A B C HSP ≥ 80% length Identity ≥ 35% A Cluster A, B, C B C SiLiX • SiLiX : single linkage clustering
SiLiX SiLiX : single linkage clusteringwithalignmentcoverageconstraints (Mièle et al. BMC Bioinformatics 2011) • Computing efficiency: • Ultra-fast • Memory efficient • Scalable (parallel architecture) • Clustering quality: • At least as good as the previously published methods
However … • Because of over-extension of BLAST alignments, some sequences that share only partial homology may be clustered in a same family • The risk of alignment over-extension is low, but becomes a problem for very large protein families • Use more stringent clustering criteria ? No : optimal clustering criteria are not the same for all families
HiFiX • The mode and tempo of evolutionisspecific to eachproteinfamily • A multiple alignmentprovides information about the specific pattern of evolution of a family • => thiscanbeused to decidewhether or not a new sequencebelongs to thatfamily
HiFiX • Step 1: rapidclustering (SiLiX) • pre-families • Step2: sub-clustering of pre-families into homogeneousproteinclusters • sub-families • Step3: progressive merging of sub-familiesintofamilies, withevaluation of multiple alignmentqualityateachstep • families
Results of clustering NumberSequencesNumber of Families • at least 2 296,920 • 2:10 242,398 • 10:500 53,450 • 500:2000 1,026 • more than 2000 79 About 7,000,000 proteinsclusteredinto 300,000 families Family size distribution:
Building Compare all proteinsagainsteachother (BLAST) Cluster homologoussequencesintofamilies (SILIX + HIFIX) Compute multiple alignments for eachfamily Computephylogenetictrees for eachfamily Annotatephylogenetictrees (gene duplications, losses, transfers)
Compute multiple alignments All alignments ( ~ 300, 000) have been calculated withClustalΩ
Building Compare all proteinsagainsteachother (BLAST) Cluster homologoussequencesintofamilies (SILIX + HIFIX) Compute multiple alignments for eachfamily Computephylogenetictrees for eachfamily Annotatephylogenetictrees (gene duplications, losses, transfers)
Computephylogenetictree Question: what about the alternative splicing ?
Alternative splicing In eukaryotes, due to alternative splicing , one unique genemaybebetranscriptedintoseveraltranscripts
Transcripts in HOGENOM6 Weselectedall the transcripts for eachgene. Becausethe longesttranscriptis not allways the best!
Because: Selection of a representaitiveisoform in HOGENOM Wedon’twantseveralproteins for a samegene in a phylogenetictree: maybeseen as a duplication Wewant 1 protein per gene for statisticcomparison amongorganisms
Selection of a representativeisoform : how ? Eukarya 1 or more transcripts per gene Archaea and bacteria 1 transcript per gene
Selection of a representativeisoform : how ? Eukarya clustering Archaea and bacteria
First step:when a gene has isoforms in differentfamilies ( ), choose a family for the gene Selection of a representativeisoform : how ?
We select the familywith the highestnumber of eukaryoticgenes (and not proteins) 1 1 1 Selection of a representativeisoform : how ? 2 2 2 3 2 genes 2 genes 3 genes
We select the familywith the highestnumber of eukaryoticgenes (and not proteins) 1 1 1 Selection of a representativeisoform : how ? 2 2 If the number of eukaryoticgenesare identical, we select the familywith the highestnumber of eukaryoticproteins 2 3 2 genes 2 genes 3 genes
We select the familywith the highestnumber of eukaryoticgenes (and not proteins) 1 1 1 Selection of a representativeisoform : how ? 2 2 If the number of eukaryoticgenesare identical, we select the familywith the highestnumber of eukaryoticproteins 2 3 If the number of eukaryoticproteinsare identical, we select the familywith the highestnumber of proteins 2 genes 2 genes 3 genes
We select the familywith the highestnumber of eukaryoticgenes (and not proteins) 1 1 1 Selection of a representativeisoform : how ? 2 2 If the number of eukaryoticgenesare identical, we select the familywith the highestnumber of eukaryoticproteins 2 3 If the number of eukaryoticproteinsare identical, we select the familywith the highestnumber of proteins 2 genes 2 genes 3 genes The « rejected » isoforms are called « ISOFORMEX » SOME FAMILIES MAY FINALLY BE EMPTY AFTER THIS
Second step:when a gene has isoforms in a family, choose a representativeisoform for the gene 1 1 1 Selection of a representativeisoform : how ? 2 2 2 3 2 genes 2 genes 3 genes
Second step: when a gene has isoforms in a family, choose a representativeisoform for the gene 1 1 1 Selection of a representativeisoform : how ? 2 2 2 3 2 genes ? 2 genes ? 3 genes
We use the alignment Selection of a representativeisoform : how ?
We use the alignment Selection of a representativeisoform : how ? Suppression of ISOFORMEX
We use the alignment Selection of a representativeisoform : how ? Selection positions with < 50% gap
Selection of a representativeisoform : how ? For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 2 2
Selection of a representativeisoform : how ? For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 1 2 2
Selection of a representativeisoform : how ? For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 1 2 2 2 2
Treecalculation isformin isformin a b c isformin d isformex e f g
Treecalculation isformin isformin a b c isformin d isformex e f g
Gblocks Phyml, FastTree Treecalculation d isformin a isformin e f g b a c b c d isformex e f g
Building Compare all proteinsagainsteachother (BLAST) Cluster homologoussequencesintofamilies (SILIX + HIFIX) Compute multiple alignments for eachfamily Computephylogenetictrees for eachfamily Annotatephylogenetictrees (gene duplications, losses, transfers)
Annotatephylogenetictrees • Severalmethods are currentlydeveloped in the ANCESTROM project • Speciation, Duplication and Loss • Speciation, Duplication, Transfert and Loss • See Vincent Daubin talk tomorow