150 likes | 269 Views
WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES. Jeffery J. Leader Department of Mathematics Rose-Hulman Institute of Technology Terre Haute, IN 47803. Gary W. Stuart Department of Life Sciences Indiana State University Terre Haute, IN 47809.
E N D
WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES Jeffery J. Leader Department of Mathematics Rose-Hulman Institute of Technology Terre Haute, IN 47803 Gary W. Stuart Department of Life Sciences Indiana State University Terre Haute, IN 47809 G-Stuart@indstate.edu Jeffery.Leader@Rose-Hulman.edu
INTRODUCTION • Whole Genome Sequences • Whole genome sequences are now available for 51 prokaryotic bacterial genomes, 5 eukaryotic genomes (yeast, roundworm, fruit fly, human, and A. thaliana), and 160 animal mitochondrial genomes. • Molecular Phylogeny • Gene or protein sequence similarity is an accurate measure of species similarity. • Methods that involve a comparison of whole genomes should improve estimates of phylogenetic relatedness. • Comprehensive methods for calculating, archiving, and accessing gene and species similarity estimates using whole genomes are needed.
METHODS: • Whole Genome Target Sets • Mitochondrial: 13 gene families in 64 species = 832 proteins • Microbial: 84 COG’s in 44 species = 5,871 proteins • Mixed: 19 whole genomes & organelles = 28,653 proteins • Peptide Frequency Matrix • 160,000 overlapping 4-grams x 832 = 133 million elements • 8,000 overlapping 3-grams x 5,871 = 46 million elements • 160,000 overlapping 4-grams x 28,653 = 4.5 billion elements • SVD and Dimension Reduction (DR) • SVD-DR using bls-1:SVDPACK (Berry) • Gene Trees and Species Trees • Gene to species data by vector addition (below) • Cosine-based distances obtained from SVD-DR vectors • Gene/species trees via NJ: PHYLIP (Felsenstein)
METHODS: • 13 n-dimensional Vectors Define each Mitochondrial Genome • 84+ n-dimensional Vectors Define each COG Subset Genome • 102-3 n-dimensional Vectors Define each Complete Microbial Genome • Separate vectors: gene trees • Vector sums: species trees
RESULTS: • Optimal Dimension Reduction for Mitochondrial Dataset • 832 proteins in 13 families placed into 16 groups @ k=73-74
RESULTS: • Gene Tree Example for Mitochodrial Dataset • 832 proteins from 64 species • 13 families formed 16 groups @ k=74 dimensions • Only ND3, ND5, and CYTB failed to group exclusively
RESULTS: • Consensus Species Tree for Mitochondrial Dataset • Optimal trees @ k=64-80 dimensions Perissodactyla Birds Carnivora Mammals Cetartiodactyla Reptiles Rodentia Fish Primates
RESULTS: • Optimal Dimension Reduction for Microbial COG Dataset • 5, 871 proteins from 44 species (including 2 fungi) • 84 gene families placed into less than 1040 groups @ k>256
RESULTS: • Gene Tree Example for Microbial COG Dataset • 84 COG families formed 997 groups @ k=259 • Too big to show! • Consensus Species Tree for COG Dataset • Optimal trees @ 257-280 dimensions • ALL 24 TREES - IDENTICAL TOPOLOGY • Challenging Relationships • Hyperthermophiles: Aaeo & Tmar • eubacteria with archaebacteria • Archaebacterium: Hnrc • archaebacteria with eubacteria • Spirochetes: Tpal & Bbur • widely separated • Bbur (& Baps) like Rpro • mitochondria like? circles: archaebacteria triangles: eubacteria squares: eukaryotes black: hyperthermophiles green: Halobacterium blue: spirochetes
RESULTS: • Consensus Species Tree for Total Genome Dataset • Optimal gene trees too big to analyze (for now?) • For species trees, k = 210-220 large enough? • Challenging Relationships • Chloroplast (Noli) not with Cyanobacteria (Sspp) • Eubacterial thermophiles with archaea • Spirochetes Tpal & Bbur widely separated • Mitochondria (Recl) with Rpro, Bbur circles: archaebacteria triangles: eubacteria black: hyperthermophiles red: mitochondria green: chloroplast blue: spirochetes
DISCUSSION: • How Does It Work? • Singular Value Decomposition (SVD): Ar = UrSr VrT • Matrix “Ar” is the original matrix (peptides x proteins) of rank “r” • Matrix “Ur” is the “peptide” matrix (peptides x singular vectors) • Matrix “Vr” is the “protein matrix (proteins x singular vectors) • Matrix “Sr” is a diagonal matrix of singular values in descending order • Graphically, SVD represents a rotation of description space • In the “old” spaces, either... • peptides are defined by protein vectors (protein vectors = columns of “A”) • proteins are defined by peptide vectors (peptide vectors = rows of “A”) • In the “new” space, • peptides are defined by singular vectors (singular vectors = columns of “U”) • proteins are defined by singular vectors (singular vectors = columns of “V”) • singular values (S) and their vectors are ordered by “importance” (variance) • Dimension Reduction removes “noise” : Ak = UkSk VkT • if smaller singular values set to zero, then k<r • Ak is always the most accurate reconstruction of Ak for a given k • How is the “correct” value of “k” determined? (where does “signal” become “noise”?)
DISCUSSION: • What are the Singular Vectors? What do they represent? • Singular Vectors are Orthonormal Basis Vectors for the “New” Space • the sum of the squares of the elements of each vector sum to exactly “1”. • Singular Vectors of “U” Define Peptides by Presence in Conserved Motifs • the individual elements of these vectors indicate contributions of each peptide • a positive value for an element indicates a positive correlation with a given motif • a negative value indicates a negative correlation with that motif • singular vectors are therefore precisely defined by relative peptide contributions • Singular Vectors of “V” Define Proteins by Presence of Conserved Motifs • the individual elements of these vectors indicate contributions of each protein • a positive value indicates protein contains the motif • a negative value indicates protein has nothing like this motif • Singular Vectors Represent and Define Conserved Motifs • motifs are described as correlated peptides (copep motifs) in the “U” matrix • copep motifs are NOT REQUIRED to have peptides in extended alignments
DISCUSSION: • SVD produces a new descriptive space ... Matrix A: “old” space with peptide bases Matrices U & V: “new” space with copep bases copep k peptide n q' q copep 2 peptide 2 copep 1 peptide1
DISCUSSION: ND3_Sscr: MNIMLTLLTNVTLASLLVLIAFWLPQLNAYSEKTSPYECGFDPMGSARLPFSMKFFLVAITFLLFDLEIALLLPLPWASQTNNLKTMLTMALFLLILLAASLAYEWTQKGLEWAE 2:LLVLIAFWLPQLN 8:FDPMGSA 5:VAITFLL 4:SLAYEW QGGLEWAE:1 LLVL 0.03683679 FDPM 0.04752085 VAIT 0.06136436 SLAY 0.04349018 LVLI 0.04179308 DPMG 0.04346968 AITF 0.06688452 LAYE 0.07069019 VLIA 0.03430444 PMGS 0.04377919 ITFL 0.06649574 AYEW 0.07069019 LIAF 0.04204521 MGSA 0.04145973 TFLL 0.06773809 IAFW 0.06443857 AFWL 0.07452322 FWLP 0.10143487 WLPQ 0.09049735 LPQL 0.05343578 PQLN 0.04795852 ND3_Pdol: MNLLIVMIISTALPIILMLLGFWLPNLNPDNEKVSPYECGFDPLGSARLPFSLKFFLVAILFLLFDLEIAILLPLPWALQYDTPTTAFLIALLILILLTLGLIYEWLQGGLEWAE 0:EKLSPYECGFDPLGSARLPFSMKFFLVAILFLLFDLEIALLLPLPWASQ 3:TLLTLGLIYEWTQ EKLS 0.05915245 TLLT 0.03798160 KLSP 0.06205812 LLTL 0.07589711 LSPY 0.06205812 LTLG 0.05564693 SPYE 0.12096941 TLGL 0.05957197 PYEC 0.13026607 LGLI 0.04020463 YECG 0.13224916 GLIY 0.04413720 ECGF 0.13224916 LIYE 0.04413720 CGFD 0.13080502 IYEW 0.04558135 GFDP 0.13080502 YEWT 0.06092579 FDPL 0.07362187 EWTQ 0.05599396 DPLG 0.06971443 PLGS 0.06871255 LGSA 0.06846775 GSAR 0.11228549 SARL 0.12589498 ARLP 0.12589498 RLPF 0.12456818 LPFS 0.12617765 PFSM 0.06863905 FSMK 0.06474277 SMKF 0.06460109 MKFF 0.06688452 KFFL 0.06847234 FFLV 0.10783106 FLVA 0.10640073 LVAI 0.10613140 VAIL 0.04388013 AILF 0.06119889 ILFL 0.06509992 LFLL 0.07745415 FLLF 0.12525181 LLFD 0.12614454 LFDL 0.12960829 FDLE 0.12960829 DLEI 0.12746527 LEIA 0.12925820 EIAL 0.12133975 IALL 0.13983325 ALLL 0.18434782 LLLP 0.13276736 LLPL 0.10135227 LPLP 0.12080467 PLPW 0.11729774 LPWA 0.08805169 PWAS 0.03825444 WASQ 0.03348659 • An Example Copep Motif ... • Singular Vector 10 • defines copep 10 as square-root probabilities of tetrapeptides • all possible peptides are included in an exhaustive description • only positive values above 0.03 shown here • describes conserved elements within the NADH Dehydrogenase 3 proteins • many peptides appear in aligned sub-motifs (but some do not) • alternative versions of the motif are represented by overlapping sub-motifs • copep motifs are IDEALIZED motifs that may not exist in nature.
Future Goals • Use a transformation on the input data. • normalization of input protein vectors • word entropy transformation or “df” transformation • improves performance in text-retrieval applications • Are n-grams larger than 4 feasible? • 3.2 million possible pentapeptides • Are multiple eukaryotic genomes feasible? • millions of genes? • networked computers • parallel code for SVD (with M. Berry, Tennessee-Knoxville)