430 likes | 528 Views
Identify Archaeal and Bacterial Phylogenetic Markers. Dongying Wu. We have more than 700 compete genome sequences: Select 100 representatives Build gene families Identify families that present in all organisms with equal numbers
E N D
Identify Archaeal and Bacterial Phylogenetic Markers Dongying Wu
We have more than 700 compete genome sequences: • Select 100 representatives • Build gene families • Identify families that present in all organisms with equal numbers • Hmm building and phylogenetic analysis to identify the true makers Firmicutes Proteobacteria Phylogenetic Tree of Bacteria (built from 31 concatenate marker alignments)
Gene Family Classification Blastp: E value cutoff 1e-10, report 10000 hits Only blastp hits that span 80% of the lengths of both genes are kept as links 313,139 genes from 100 genomes => 28,710,015 links MCL Clustering Algorithm Links (matrix of sequence similarities) Expansion Inflation (I=2) equilibrium state
73686 Singletons, 23336 families(239453 genes) • Rules for Families of Markers: • The family has to cover all 100 genomes (high universality) • Each genome has to have equal numbers (high evenness) Ni: the number of the gene family members from the genome i; Nm: the medium of Ni of the 100 genomes; Ng: the total genome number; Universality is the genome number a family involves
Phylogenetic Marker Identification Out of the 502 families with high universality: * 31 phylogenetic markers from AMPHORA * 39 marker candidates with high evenness number (>=80) (25 families are either single copied in each genome or double copied in one genome that co-branched in phylogenetic trees) Build PHYML trees with the AMPHORA markers and 25 marker candidates, and compare the tree topologies with the genome tree
NODAL distance (TOPD/FMTS)
Split (Robinson-Foulds) Distance (TOPD/FMTS) G F A F A G bad edge bad edge bad edge bad edge good edge good edge good edge good edge E E B B D D C C ratio of the internal edges being bad (0-1)
Distances between gene trees and the AMPHORA concatenated genome tree rpmA coaE coaE rpmA trmD rplL rpsS rpsQ radA rplR rplD rplQ tsf rpsH frr smpB ttf rpsO rplR rplP rplM rpsS rplI rplV rpsB rplT rpsO rplO mraW rpsP rpsH rpsK rplQ rplU rplL tsf rplT trmD rplE rplS rpsP ttf rplC rpsI rplV mraW rplS rpsL infC rpsG rpsM rplM rplO rplI rplU pyrH rpsL rpsM rpsQ ruvA guaA radA rpsG purA smpB rplK priA rplD rpsK infC rplK rplC serS rplE rplA rplA rplF frr ruvA rplF rpsC serS rplN rplN rplP guaA rpsE ruvB pyrH rpsB rpsI rpsJ secY rRNA16S rpsJ secY purA rplB rplB priA nusA rpsE ruvB rpsC rRNA16S nusA 0 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 NODAL distance SPLIT distance AMPHORA marker Ribosomal protein Transcription/translation related protein DNA repair protein Protein of other function Distance between the genome tree and 100 random trees (average standard deviation)
SACCHAROPHAGUS DEGRADANS 2 40 PSEUDOMONAS SYRINGAE PV SYRINGAE B728A HAHELLA CHEJUENSIS KCTC 2396 Genome Tree rpmA NITROSOCOCCUS OCEANI ATCC 19707 ACINETOBACTER SP ADP1 PSYCHROMONAS INGRAHAMII 37 ALCANIVORAX BORKUMENSIS SK2 THIOMICROSPIRA CRUNOGENA XCL 2 ESCHERICHIA COLI K12 COLWELLIA PSYCHRERYTHRAEA 34H PSYCHROMONAS INGRAHAMII 37 ESCHERICHIA COLI K12 COLWELLIA PSYCHRERYTHRAEA 34H gamma SACCHAROPHAGUS DEGRADANS 2 40 THIOMICROSPIRA CRUNOGENA XCL 2 METHYLOCOCCUS CAPSULATUS STR BATH XYLELLA FASTIDIOSA TEMECULA1 HAHELLA CHEJUENSIS KCTC 2396 NITROSOCOCCUS OCEANI ATCC 19707 ALCANIVORAX BORKUMENSIS SK2 METHYLOCOCCUS CAPSULATUS STR BATH PSEUDOMONAS SYRINGAE PV SYRINGAE B728A LEGIONELLA PNEUMOPHILA STR LENS NEISSERIA MENINGITIDIS Z2491 COXIELLA BURNETII RSA 493 BURKHOLDERIA MALLEI ATCC 23344 1 FRANCISELLA TULARENSIS SUBSP TULARENSIS FSC198 NITROSOMONAS EUTROPHA C91 NEISSERIA MENINGITIDIS Z2491 ZYMOMONAS MOBILIS SUBSP MOBILIS ZM4 NITROSOMONAS EUTROPHA C91 beta MAGNETOSPIRILLUM MAGNETICUM AMB 1 METHYLIBIUM PETROLEIPHILUM PM1 ROSEOBACTER DENITRIFICANS OCH 114 BURKHOLDERIA MALLEI ATCC 23344 NITROBACTER HAMBURGENSIS X14 MAGNETOSPIRILLUM MAGNETICUM AMB 1 GLUCONOBACTER OXYDANS 621H GLUCONOBACTER OXYDANS 621H CAULOBACTER CRESCENTUS CB15 ZYMOMONAS MOBILIS SUBSP MOBILIS ZM4 BARTONELLA HENSELAE STR HOUSTON 1 alpha BARTONELLA HENSELAE STR HOUSTON 1 CANDIDATUS PROTOCHLAMYDIA AMOEBOPHILA UWE25 CYTOPHAGA HUTCHINSONII ATCC 33406 NITROBACTER HAMBURGENSIS X14 PORPHYROMONAS GINGIVALIS W83 ROSEOBACTER DENITRIFICANS OCH 114 GRAMELLA FORSETII KT0803 CAULOBACTER CRESCENTUS CB15 COXIELLA BURNETII RSA 493 HYPHOMONAS NEPTUNIUM ATCC 15444 LEGIONELLA PNEUMOPHILA STR LENS MYXOCOCCUS XANTHUS DK 1622 FRANCISELLA TULARENSIS SUBSP TULARENSIS FSC198 SORANGIUM CELLULOSUM SO CE 56 HYPHOMONAS NEPTUNIUM ATCC 15444 BDELLOVIBRIO BACTERIOVORUS HD100 delta METHYLIBIUM PETROLEIPHILUM PM1 GEOBACTER SULFURREDUCENS PCA ACINETOBACTER SP ADP1 SYNTROPHUS ACIDITROPHICUS SB XYLELLA FASTIDIOSA TEMECULA1 DESULFOTALEA PSYCHROPHILA LSV54 PROCHLOROCOCCUS MARINUS SUBSP PASTORIS STR CCMP1986 DESULFOVIBRIO VULGARIS SUBSP VULGARIS STR HILDENBOROUGH SALINIBACTER RUBER DSM 13855 NITRATIRUPTOR SP SB155 2 BIFIDOBACTERIUM LONGUM NCC2705 SULFURIMONAS DENITRIFICANS DSM 1251 STREPTOMYCES COELICOLOR A3 2 ARCOBACTER BUTZLERI RM4018 ARTHROBACTER AURESCENS TC1 SULFUROVUM SP NBC37 1 Epsilon CLAVIBACTER MICHIGANENSIS SUBSP MICHIGANENSIS NCPPB 382 CAMPYLOBACTER JEJUNI SUBSP JEJUNI NCTC 11168 NOCARDIA FARCINICA IFM 10152 HELICOBACTER HEPATICUS ATCC 51449 SALINISPORA TROPICA CNB 440 HELICOBACTER PYLORI J99 FRANKIA ALNI ACN14A TREPONEMA DENTICOLA ATCC 35405 CORYNEBACTERIUM EFFICIENS YS 314 Spirochaetes LEPTOSPIRA BORGPETERSENII SEROVAR HARDJO BOVIS L550 PROPIONIBACTERIUM ACNES KPA171202 Planctomycetes RHODOPIRELLULA BALTICA SH 1 CHLOROBIUM TEPIDUM TLS CANDIDATUS PROTOCHLAMYDIA AMOEBOPHILA UWE25 TREPONEMA DENTICOLA ATCC 35405 Chlamydiae LEPTOSPIRA BORGPETERSENII SEROVAR HARDJO BOVIS L550 1 CHLOROBIUM TEPIDUM TLS Chlorobi THERMOTOGA MARITIMA MSB8 SALINIBACTER RUBER DSM 13855 AQUIFEX AEOLICUS VF5 CYTOPHAGA HUTCHINSONII ATCC 33406 THERMUS THERMOPHILUS HB27 Bacteroidetes PORPHYROMONAS GINGIVALIS W83 HELICOBACTER PYLORI J99 GRAMELLA FORSETII KT0803 HELICOBACTER HEPATICUS ATCC 51449 Fusobacteria FUSOBACTERIUM NUCLEATUM SUBSP NUCLEATUM ATCC 25586 ARCOBACTER BUTZLERI RM4018 CORYNEBACTERIUM EFFICIENS YS 314 NITRATIRUPTOR SP SB155 2 NOCARDIA FARCINICA IFM 10152 CAMPYLOBACTER JEJUNI SUBSP JEJUNI NCTC 11168 SALINISPORA TROPICA CNB 440 Actinobacteria SULFUROVUM SP NBC37 1 FRANKIA ALNI ACN14A SULFURIMONAS DENITRIFICANS DSM 1251 STREPTOMYCES COELICOLOR A3 2 DEHALOCOCCOIDES SP CBDB1 ARTHROBACTER AURESCENS TC1 BDELLOVIBRIO BACTERIOVORUS HD100 CLAVIBACTER MICHIGANENSIS SUBSP MICHIGANENSIS NCPPB 382 MYXOCOCCUS XANTHUS DK 1622 PROPIONIBACTERIUM ACNES KPA171202 SORANGIUM CELLULOSUM SO CE 56 BIFIDOBACTERIUM LONGUM NCC2705 SYNTROPHUS ACIDITROPHICUS SB PROCHLOROCOCCUS MARINUS SUBSP PASTORIS STR CCMP1986 RHODOPIRELLULA BALTICA SH 1 SYNECHOCYSTIS SP PCC 6803 Cyanobacteria DESULFOTALEA PSYCHROPHILA LSV54 GLOEOBACTER VIOLACEUS PCC 7421 DESULFOVIBRIO VULGARIS SUBSP VULGARIS STR HILDENBOROUGH DEHALOCOCCOIDES SP CBDB1 CHLOROFLEXUS AURANTIACUS J 10 FL Chloroflexi GEOBACTER SULFURREDUCENS PCA CHLOROFLEXUS AURANTIACUS J 10 FL CARBOXYDOTHERMUS HYDROGENOFORMANS Z 2901 GLOEOBACTER VIOLACEUS PCC 7421 SYNECHOCYSTIS SP PCC 6803 PELOTOMACULUM THERMOPROPIONICUM SI CLOSTRIDIUM DIFFICILE 630 DESULFITOBACTERIUM HAFNIENSE Y51 SYMBIOBACTERIUM THERMOPHILUM IAM 14863 SYMBIOBACTERIUM THERMOPHILUM IAM 14863 DESULFITOBACTERIUM HAFNIENSE Y51 CLOSTRIDIUM DIFFICILE 630 THERMOANAEROBACTER TENGCONGENSIS MB4 CLOSTRIDIUM KLUYVERI DSM 555 CARBOXYDOTHERMUS HYDROGENOFORMANS Z 2901 THERMOANAEROBACTER TENGCONGENSIS MB4 Firmicutes PELOTOMACULUM THERMOPROPIONICUM SI BACILLUS LICHENIFORMIS ATCC 14580 CLOSTRIDIUM KLUYVERI DSM 555 OCEANOBACILLUS IHEYENSIS HTE831 OENOCOCCUS OENI PSU 1 STAPHYLOCOCCUS SAPROPHYTICUS SUBSP SAPROPHYTICUS ATCC 15305 LACTOBACILLUS HELVETICUS DPC 4571 LISTERIA WELSHIMERI SEROVAR 6B STR SLCC5334 LACTOBACILLUS REUTERI F275 LACTOCOCCUS LACTIS SUBSP CREMORIS SK11 LACTOBACILLUS CASEI ATCC 334 LACTOBACILLUS CASEI ATCC 334 LACTOCOCCUS LACTIS SUBSP CREMORIS SK11 LACTOBACILLUS HELVETICUS DPC 4571 LISTERIA WELSHIMERI SEROVAR 6B STR SLCC5334 LACTOBACILLUS REUTERI F275 BACILLUS LICHENIFORMIS ATCC 14580 OENOCOCCUS OENI PSU 1 OCEANOBACILLUS IHEYENSIS HTE831 THERMUS THERMOPHILUS HB27 STAPHYLOCOCCUS SAPROPHYTICUS SUBSP SAPROPHYTICUS ATCC 15305 DEINOCOCCUS RADIODURANS R1 FUSOBACTERIUM NUCLEATUM SUBSP NUCLEATUM ATCC 25586 AQUIFEX AEOLICUS VF5 DEINOCOCCUS RADIODURANS R1 1 THERMOTOGA MARITIMA MSB8 0.2 0.2
Better Tree Comparison is Needed Not all edges are equal We need to know how a marker performs at different taxonomic levels and groups
Actinobacteria: 47 pre-GEBA genomes 26 GEBA genomes(16 completed) Only 63 competed actinobacterial genomes are included in this study Basic rules: Every genome should have only one copy from a family for that family to be counted as marker candidate (plus/minus 1)
63 genome (251585 proteins, 18534 large family-proteins) 20460854 links 38450 MCL clusters 818 cluster (>=62 members and <2000 members) BLASTP (cutoff 1e-10 over 80% span) MCL (I=2) 170 families with 62-64 members: 105 can be marker candidates Universality 100 (size=63-64), 98 (size=62)
ISSUE ONE: Are there any markers embedded in the larger clusters? ObgE YchF GTP-binding protein
Automatic Tree Screening: Pick clades with the desirable number of taxa calculate universality and envenness Generate families and Building HMMs Search the hmm profiles against the entire actinobacterial peptides to see if the families are distinct
818 trees (60-2000 genes/tree, Build by MUSCLE/FastTree) 155 clades with leave-number=63, universality=100, evenness=100 The Good murD UDP-N-acetylmuramoyl-L-alanyl-D-glutamate synthetase
The Bad serS seryl-tRNA synthetase
HMMER3 hmmbuild into 155 profiles Search against the actinobacterial genomes, only keep the following HMMs: Scenario 1: Seeds hit E-value<=1e-20, none-seeds E-value >1e-3 Scenario 2: Best None-seed hit E-value (En) <= 1e-3 The Worst seed hit E-value <= 1e-17*En The Worst seed bit-score is more than twice of the None-seed bit-score extreme value distribution Worst seed Best none-seed
ISSUE TWO: Are there any markers families torn apart in the clustering process? BLASTP links Exclude Marker candidates Large MCL family members (>=1000/family) Single linkage clustering Single Linkage Families
ISSUE THREE: Miss-placed deep branch lepA GTP-binding protein
136 actinobacterial markers from 63 Actinobacterial genomes single-linkage clusters One deletion in one genome One duplication in one genome tree topology correction 2 9 22 18 32 Tree-based picking from MCL clusters 93 96 original MCL clusters One copy/genome
Select completed genomes from IMG for the following group Use wget to get the sequences from the website Gene marker identify pipeline (BLASTP,MCL clustering, tree building, clade evaluation)
Cluster HMM profiles Hmm Profile One Consensus Sequence Consensus Sequences for all HMMs (5133 lineage specific families + 56 Bacterial marker families) All vs All BLASTP (E value cutoff = 1e-3) Single Linkage Clustering -> 570 clusters (size >= 2) -> build 404 trees
Example of A tree (SIN323: carbamoyl-phosphate synthase)
Sampling and Analysis the Tree Automatically Split the tree one edge at a time Evaluate evenness If evenness = 100 and single copied for each group HMM profile building and search against all the consensus sequences
Are the seed peptides distinct? What cutoff to use? A sample of HMM search output (the seeds are marked red) THERMI894 RELAX5_SIN270.tre.ID542.faa.trim 3.8e-57 191.5 THERMO479 RELAX5_SIN270.tre.ID542.faa.trim 4.8e-56 187.9 EPSI254 RELAX5_SIN270.tre.ID542.faa.trim 7.1e-56 187.3 ALPHA58 RELAX5_SIN270.tre.ID542.faa.trim 4.2e-55 184.8 CHLOFL232 RELAX5_SIN270.tre.ID542.faa.trim 8.5e-52 174.0 BARIO164 RELAX5_SIN270.tre.ID542.faa.trim 4.5e-51 171.7 CHLAM424 RELAX5_SIN270.tre.ID542.faa.trim 1.3e-50 170.1 GAMMA93 RELAX5_SIN270.tre.ID542.faa.trim 1.2e-43 147.4 CYANO551 RELAX5_SIN270.tre.ID542.faa.trim 1.4e-42 144.0 ARCH63 RELAX5_SIN270.tre.ID542.faa.trim 7.2e-21 73.3 CYANO18 RELAX5_SIN270.tre.ID542.faa.trim 2.8e-06 25.7 THERMO26 RELAX5_SIN270.tre.ID542.faa.trim 3.1e-06 25.6 EPSI354 RELAX5_SIN270.tre.ID542.faa.trim 6.5e-05 21.3
B = A = e Exponential curve Fitting to identify Hmmsearch E value cutoff lg(Eseed_cutoff/Etop_none_seed) = A e B lg(Etop_none_seed) lg(Eseed_cutoff/Etop_none_seed) Lg(Etop_none_seed) Position 1: [x1=lg(1e-3) y1=lg(1e-15/1e-3)] Position 2: [x2=lg(1e-250) y2=lg(1e-1000/1e-250)]
Get all the potential groups that can be marker candidates One consensus sequence from one phylogenetic group in one clade The sequences are distinct from other sequences Overlap problems and solutions
Example of A tree (SIN323: carbamoyl-phosphate synthase)
684 families that span multiple taxonomic group Accumulative Distribution Family Number Simple Distribution Family Size 383 families that span >=4 taxonomic groups
We have 382 clades (including whole trees) that are potential marker families that each spans at least four different taxonomic groups
Search a group of genomes using a distant HMM profile A group of HMM profiles Combine the seeds and build a new profile HMM Search one group that is missing from the HMM list Get a large number of hits from the top (2 x genome number) and mark the very top hits Search a group of genomes using a HMM profile of insiders Tree building, and evaluate the clades: Must include the very top hits HMM building from the clades to estimate uniqueness *Manual examinations are required in some cases
All the peptides from for a given family MUSCLE Alignment Hmm Profile Hmm search against all the complete genome database Look through the hmm search results and determine if the hmm can distinguish family members from others
Tree building Alignment ZORRO mask
Use 0.1 as the first round ZORRO cutoff Trim the alignments and calculate the second ZORRO mask score
Build PHYML trees for all the families (alignments trimmed by the second ZORRO mask) Ribosomal protein S4 PHYML tree (MF00001)
Monophyletic Analysis A list of taxa that are assumed to be monophyletic can be divided into separate clades A monophyletic value is designed to estimate if given list of taxa are monophyletic or not quantitatively
Shannon entropy measures uncertainty in a dataset 100% All taxa from a phylum form a monophyletic clade: Uncertainty -> 0 p1 Uncertainty increases if Clades number increase Evenness increase All taxa from a phylum spread into N clades: p2 p3 p1+p2+p3=1 Shannon entropy calculation:
Sample number H Monophyletic Value = 100 x Shannon Entropy Calculate Shannon entropy for 100 taxa distributed in N bins (N=2..10) (repeat the calculation for 10,000 random simulations for each N)
Monophyly Value Shannon Entropy
161 families are kept For at least 4 taxonomic groups Universality * Evenness * monophyly >= 90*90*90 PMPROK00023: ribosome recycling factor LIST:ARCH UNIVERSALITY:NA EVENNESS:NA MONOPHYLY:NA LIST:BACT UNIVERSALITY:99.67 EVENNESS:98.68 MONOPHYLY:NA LIST:ACTINO UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:78.84 LIST:BARIO UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:59.78 LIST:CHLAM UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:100.00 LIST:CHLOFL UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:60.37 LIST:CYANO UNIVERSALITY:100.00 EVENNESS:81.04 MONOPHYLY:100.00 LIST:FIRM UNIVERSALITY:99.06 EVENNESS:100.00 MONOPHYLY:85.98 LIST:SPIRO UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:53.75 LIST:THERMI UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:100.00 LIST:THERMO UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:100.00 LIST:PROTEO UNIVERSALITY:99.69 EVENNESS:100.00 MONOPHYLY:44.61 LIST:ALPHA UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:63.18 LIST:BETAGAMMA UNIVERSALITY:99.45 EVENNESS:100.00 MONOPHYLY:97.47 LIST:BETA UNIVERSALITY:98.21 EVENNESS:100.00 MONOPHYLY:100.00 LIST:GAMMA UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:79.67 LIST:DELTA UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:88.17 LIST:EPSI UNIVERSALITY:100.00 EVENNESS:100.00 MONOPHYLY:100.00
What is next: 1. Search IMG again to update the seqs and accessions 2. Develop CGI scripts to retrieve user defined markers (calculate universality, evenness and monophyly on the fly)