HOGENOM a phylogenomic database

HOGENOM a phylogenomicdatabase Simon Penel, Pascal Calvat,Jean-FrancoisDufayard, Vincent Daubin, Laurent Duret,Manolo Gouy, Dominique Guyot,Daniel Kahn, Vincent Miele,Vincent Navratil, Guy Perrière, Rémi Planel

Severalphylogenomicdatabasesdevelopedat LBBE/PRABI • HOVERGEN • VerterbrateProteinsfromUniProt • ClusteringwithSiLiX • HOMOLENS • ProteinsfromEnsembl Complete Genomes • ClusteringfromEnsembl • Treescalculated and annoated (S,D,L) with new methods (PhylDog,LBBE) • HOGENOM • Proteinsfrom all availablecompletegenomes • (Bacteria, Eukaroyota, Archaea) • ClusteringwithSiLiX and post-processingwithHiFiX • Treeswillbeannotated (S,D,L,T)

HOGENOM characteristics • all completegenomesfrom the wholetree of life (not restricted to particular phylum) • Propose « genefamilies » : full lengthhomologoussequences (different of « domainfamilies »)

Domain vs.genefamilies Proteindomainfamily Families of homologous protein domains (ProDom): - Evolution by domain shuffling (duplication, loss, translocation)

Domain vs.genefamilies Homologousgenefamily Families of homologous protein domains (ProDom): - Evolution by domain shuffling (duplication, loss, translocation) Homologous Gene families (HOGENOM): - Evolution of homologous genes by speciation or by gene duplication, or horizontal transfer - Sequences are homologous over their entire length (or almost)

Orthologs and paralogs in HOGENOM HOGENOM is centered on phylogenetic trees of gene families. Information on orthologs and paralogs can be deduced from gene trees: - from the annotation of gene trees (Duplication, Speciation, Transfer) - from query tools such as tree-pattern matching

Building Compare all proteinsagainsteachother (BLAST) Cluster homologoussequencesintofamilies (SILIX + HIFIX) Compute multiple alignments for eachfamily Computephylogenetictrees for eachfamily Annotatephylogenetictrees (gene duplications, losses, transfers)

Compare all proteinsagainsteachother • Iterative BLAST calculation • Use of a non-redundantproteinsequencedatabase … (all know proteins , about 20,000,000 non redondant sequences) … associatedwitha resultingBLAST hits database (fromwhich blast hits maybeextracted) • Cluster, grid and cloudcomputing

 Local pairwise alignments Proteindatabase SiLiX 1ststep : similaritysearch BLASTP BLOSUM62 E ≤ 10-4

SiLiX 2ndstep : SiLiXclustering Use the all-against-all BLAST hits

S3 S1 S2 S4 Seq. A Seq. B S1’ S2 Seq. A Seq. B ∆lg3 ∆lg1 lgHSP1 ∆lg2 lgHSP2 SiLiX SiLiX : Selection of consistent HSPs

A A B C HSP ≥ 80% length Identity ≥ 35% A Cluster A, B, C B C SiLiX • SiLiX : single linkage clustering

SiLiX SiLiX : single linkage clusteringwithalignmentcoverageconstraints (Mièle et al. BMC Bioinformatics 2011) • Computing efficiency: • Ultra-fast • Memory efficient • Scalable (parallel architecture) • Clustering quality: • At least as good as the previously published methods

However … • Because of over-extension of BLAST alignments, some sequences that share only partial homology may be clustered in a same family • The risk of alignment over-extension is low, but becomes a problem for very large protein families • Use more stringent clustering criteria ? No : optimal clustering criteria are not the same for all families

HiFiX • The mode and tempo of evolutionisspecific to eachproteinfamily • A multiple alignmentprovides information about the specific pattern of evolution of a family • => thiscanbeused to decidewhether or not a new sequencebelongs to thatfamily

HiFiX • Step 1: rapidclustering (SiLiX) • pre-families • Step2: sub-clustering of pre-families into homogeneousproteinclusters • sub-families • Step3: progressive merging of sub-familiesintofamilies, withevaluation of multiple alignmentqualityateachstep • families

HiFiX

Results of clustering NumberSequencesNumber of Families • at least 2 296,920 • 2:10 242,398 • 10:500 53,450 • 500:2000 1,026 • more than 2000 79 About 7,000,000 proteinsclusteredinto 300,000 families Family size distribution:

Compute multiple alignments All alignments ( ~ 300, 000) have been calculated withClustalΩ

Computephylogenetictree Question: what about the alternative splicing ?

Alternative splicing In eukaryotes, due to alternative splicing , one unique genemaybebetranscriptedintoseveraltranscripts

Transcripts in HOGENOM6 Weselectedall the transcripts for eachgene. Becausethe longesttranscriptis not allways the best!

Because: Selection of a representaitiveisoform in HOGENOM Wedon’twantseveralproteins for a samegene in a phylogenetictree: maybeseen as a duplication Wewant 1 protein per gene for statisticcomparison amongorganisms

Selection of a representaitiveisoform : how ?

Selection of a representativeisoform : how ? Eukarya 1 or more transcripts per gene Archaea and bacteria 1 transcript per gene

Selection of a representativeisoform : how ? Eukarya clustering Archaea and bacteria

First step:when a gene has isoforms in differentfamilies ( ), choose a family for the gene Selection of a representativeisoform : how ?

We select the familywith the highestnumber of eukaryoticgenes (and not proteins) 1 1 1 Selection of a representativeisoform : how ? 2 2 2 3 2 genes 2 genes 3 genes

We select the familywith the highestnumber of eukaryoticgenes (and not proteins) 1 1 1 Selection of a representativeisoform : how ? 2 2 If the number of eukaryoticgenesare identical, we select the familywith the highestnumber of eukaryoticproteins 2 3 2 genes 2 genes 3 genes

We select the familywith the highestnumber of eukaryoticgenes (and not proteins) 1 1 1 Selection of a representativeisoform : how ? 2 2 If the number of eukaryoticgenesare identical, we select the familywith the highestnumber of eukaryoticproteins 2 3 If the number of eukaryoticproteinsare identical, we select the familywith the highestnumber of proteins 2 genes 2 genes 3 genes

We select the familywith the highestnumber of eukaryoticgenes (and not proteins) 1 1 1 Selection of a representativeisoform : how ? 2 2 If the number of eukaryoticgenesare identical, we select the familywith the highestnumber of eukaryoticproteins 2 3 If the number of eukaryoticproteinsare identical, we select the familywith the highestnumber of proteins 2 genes 2 genes 3 genes The « rejected » isoforms are called « ISOFORMEX » SOME FAMILIES MAY FINALLY BE EMPTY AFTER THIS

Second step:when a gene has isoforms in a family, choose a representativeisoform for the gene 1 1 1 Selection of a representativeisoform : how ? 2 2 2 3 2 genes 2 genes 3 genes

Second step: when a gene has isoforms in a family, choose a representativeisoform for the gene 1 1 1 Selection of a representativeisoform : how ? 2 2 2 3 2 genes ? 2 genes ? 3 genes

We use the alignment Selection of a representativeisoform : how ?

We use the alignment Selection of a representativeisoform : how ? Suppression of ISOFORMEX

We use the alignment Selection of a representativeisoform : how ? Selection positions with < 50% gap

Selection of a representativeisoform : how ? For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 2 2

Selection of a representativeisoform : how ? For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 1 2 2

Selection of a representativeisoform : how ? For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 1 2 2 2 2

Treecalculation

Treecalculation isformin isformin a b c isformin d isformex e f g

Gblocks Phyml, FastTree Treecalculation d isformin a isformin e f g b a c b c d isformex e f g

Annotatephylogenetictrees • Severalmethods are currentlydeveloped in the ANCESTROM project • Speciation, Duplication and Loss • Speciation, Duplication, Transfert and Loss • See Vincent Daubin talk tomorow

HOGENOM a phylogenomic database

HOGENOM a phylogenomic database

Presentation Transcript

Structural Phylogenomic Analysis

Using a database

Creating a Database

Creating a Database

Managing a Database

Creating A Database

Building a Database

Creating a database

DESIGNING A DATABASE

Searching a Database

Designing a database

The Evolution of Microbes and Their Genomes: A Phylogenomic Perspective

QUERYING A DATABASE

Querying a Database

First release of HOGENOM, a database of homologous genes from complete genome

Creating a Database

Creating a Database

Creating a Database

A Database of

Structural Phylogenomic Analysis

Maintaining A Database

Administrating a Database