1 / 44

Institute of Genetics and Selection of Industrial Microorganisms, Moscow, Russia

Combined network of transcription regulation and protein-protein interaction for inferring genome-wide functional linkages. Institute of Genetics and Selection of Industrial Microorganisms, Moscow, Russia. Russian-Indian Collaborating Project.

libby
Download Presentation

Institute of Genetics and Selection of Industrial Microorganisms, Moscow, Russia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combined network of transcription regulation and protein-protein interaction for inferring genome-wide functional linkages Institute of Genetics and Selection of Industrial Microorganisms, Moscow, Russia Novosibirsk,Indian_Russian Meeting, 2008

  2. Russian-Indian Collaborating Project Kharkevich Institute of Information Transition Problems, Russian Academy of Sciences x Prof. Shekhar Mande State Research Center of Genetics and Selection of Industrial Microorganisms, Moscow, Russia Prof. Mikhail Gelfand Novosibirsk,Indian_Russian Meeting, 2008

  3. Comparative genomics can show gene functional linkages • Co-occurrence in known operons • Minimal distance between a pair of genes in a genome (unknown operons) • Phylogenetics profiling (similar behaviour of a gene pair in several genomes) 266 linear genomes allow to evaluate functional linkages between genes by statistical methods Yellaboina et al. Genome Research, 2007 17: 527-535 Novosibirsk,Indian_Russian Meeting, 2008

  4. What are the mechanisms behind “functionally related genes”? Protein-protein interactions obtained in high- throughoutput experimental methods correlate well with functional relatedness of genes obtained with bioinformatics. Protein-protein interaction… Yellaboina et al. Genome Research, 2007 17: 527-535 BUT … Metabolic pathways Or several simultaneously Transcription co-regulation Other extravagant mechanisms (direct interactions in genome) Novosibirsk,Indian_Russian Meeting, 2008

  5. Bioinformatics of transcription regulation of bacterial genes • Specific promoters • RNA based switches • Specific protein transcription regulatory factors(TF) TF-mediated regulation is often responsible for regulation of complex processes Cross – talk Non-trivial concentration dependence (quorum Sensing) Novosibirsk,Indian_Russian Meeting, 2008

  6. DNA-signals responsible for TF binding Bacterial regulatory sites: 1. Usually long and divergent 2. Often positioning referred to the promoter is important 3. Sites for crass-talking proteins may overlap Novosibirsk,Indian_Russian Meeting, 2008

  7. Integrated database Functionally related genes Protein interactions Transcription associated genes Methabolic associated genes Novosibirsk,Indian_Russian Meeting, 2008

  8. Bioinformatics for hierarchy of organization levels of biosystems 12 program components integrated into a single system DNA Sequence Sequence RNA Structure Sequence TandemSWAN, BASIO, ALEX, SeSiMCMC, STRUSWER, STRUDL, RNA-MBFS, Prophet, Oligomeasure, PSACR, Combinator, KMD Protein Structure Complex Variation between species and individuals in populations Novosibirsk,Indian_Russian Meeting, 2008

  9. Some technical points Novosibirsk,Indian_Russian Meeting, 2008

  10. Two integrated databases • Molecular entities • Genome annotations PathWay Studio, Ariadne Genomics, Inc Original database of genome annotation and transcription regulation Novosibirsk,Indian_Russian Meeting, 2008

  11. Integration of data on binding sites and genome annotations • All experimental and predicted binding sites and other segments data are mapped into genome. • Filtration of multiple identical entries and obviously irrelevant sites in EcoCyc • Site positioning in relation with other genomic structures (repeats, genes) • Motifs are represented as lists of allowed words • Different experimental sources, as well as comparative genomics studies are used for motif construction Novosibirsk,Indian_Russian Meeting, 2008

  12. Viewpoints • Database that contain the experimental data and computational predictions in the integrated manner • XML format for organizing data flow • Possible distributed computations • Possible platform independence (Ruby & Java) Novosibirsk,Indian_Russian Meeting, 2008

  13. Unified storage for experimental data from different sources SELEX Genome Motif models Comparative genomics footprinting filtering identical and irrelevant motifs, preprocessing small-BiSMark XML-based small language for Biological Sequence Markup database engine Novosibirsk,Indian_Russian Meeting, 2008

  14. Identification of optimal binding motifs using stochastic optimization • SeSiMCMC – Gibbs sampler based algorithm for identification of binding motif • Multiple local alignment of candidate genomic sequences • Optimization of the motif length • Modeling of diades (palindromes and tandem repeats) in motif structures • Priors for absent sites and sites at the forward and backward DNA strands Known binding site motif (SELEX, a sequence logofor Sp1 factor binding site) SeSiMCMC result on a TRANSFAC dataset Novosibirsk,Indian_Russian Meeting, 2008

  15. SeSiMCMC sampler page Novosibirsk,Indian_Russian Meeting, 2008

  16. Identification of spaced and overlapping motifs Novosibirsk,Indian_Russian Meeting, 2008

  17. Regulatoryregions: Different types of architecture Overlapping and spaced binding sites Homotypic Clusters Clusters aligned with promoters ArcA sites Promoter Novosibirsk,Indian_Russian Meeting, 2008

  18. Statistical validation of selectivity and identification of optimal binding motifs • AhoPro algorithm for calculation of P-value of site binding • Comparison of different binding motifs • Using different motif models • Selection of the optimal motif • Direct calculation of motif selectivity for different specificity levels • Motif models support includes • Positional weight matrices • Word lists • IUPAC strings Novosibirsk,Indian_Russian Meeting, 2008

  19. Aho-Pro algorithm: exact P-value calculation • Aho-Corasick pattern matching automaton root C A H1 = {ACC, AСT} H2 = {AT, CT} T C T T • Each state at ith step – • classCi (r1, r2;q) C • Probabilities to be at each state (probability transducer) the longest suffix in prefix closure of H1UH2 number of occurrences of the second motif step (text length) number of occurrences of the first motif Novosibirsk,Indian_Russian Meeting, 2008

  20. AhoPro – p-value calculator! We developed an algorithm ofexact p-value calculation for multiple occurrences of multiples motifs Boeva, V., J. Clement, M. Regnier, M.A. Roytberg, and V.J. Makeev. 2007. Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules. Algorithms Mol Biol 2: 13. Novosibirsk,Indian_Russian Meeting, 2008

  21. A data flow for motif model construction Footprinting results SELEX ChIP-chip Genome-mapped with correct flanking sequences Short sites or site parts Raw long sequences May be used as mask To be used as initial mask SeSiMCMC SeSiMCMC Additional motif length estimation Additional motif length estimation Sp1 binding site Motif model Motif model Verification Novosibirsk,Indian_Russian Meeting, 2008

  22. Obtaining clean data from specific sources Using TRANSFAC as base data source for binding sites of a selected factor small-BiSMark database engine Novosibirsk,Indian_Russian Meeting, 2008

  23. A verification procedure for created motif model New motif model Testing sequence set Wisely chosen set of motif-containing sequences AhoPro Selectivity testing Choosing optimal motif specificity Processed experimental data (via SeSiMCMC) Newly discovered motif (by SeSiMCMC or ScanSeq) Novosibirsk,Indian_Russian Meeting, 2008

  24. Comparative motif analysis Testing sequence set New motif model Footprinting, ChIP-chip data, Random generated set Known motif model 1 Known motif model 2 Known motif model 3 AhoPro Selectivity testing Comparative analysis Selecting best motif model Novosibirsk,Indian_Russian Meeting, 2008

  25. An a genome-wide motif distribution mapping New motif model Known motif model 1 Genome-wide globally positioned on chromosome sites with different quality Known motif model 2 Known motif model 3 Possible clustering of sites: different models for one factor best models for different factors Positioning within specific DNA regions: CRM, CpG islands, etc. Novosibirsk,Indian_Russian Meeting, 2008

  26. Distributed computations support «Theatre manager» Possible multiple Opera House management for grid computing support (request redirecting and resource balancing only) «Opera House» «Opera House» Single physical machine multi-process remote task execution control service «Opera House» Physical machine Main database Specified scenario «opera libretto» execution Novosibirsk,Indian_Russian Meeting, 2008

  27. Overview of the technical realization of the complex Database level Application level MySQL SeSiMCMC and AhoPro High-speed C++ code Server level Ruby-powered cross-platform DRb-based server Web-interface level Ruby-based CGI Ruby-on-rails in future Data-workflow level Ruby-powered cross-platform scenario scripts small-BiSMark processing Ruby and Java-based tools (REXML, JAXP, SAXON) Novosibirsk,Indian_Russian Meeting, 2008

  28. THE END Novosibirsk,Indian_Russian Meeting, 2008

  29. Acknowledgments • GosNIIgenetika group: • Vsevolod Makeev • Alexander Favorov • Elizaveta Permina • Valentina Boeva • Ivan Kulakovsky • Dmitry Malko Financial support Russian Federation State Innovation Project Russian Foundation of Basic Research DST India Novosibirsk,Indian_Russian Meeting, 2008

  30. Biological data analysis components • DNA analysis: • Basio – large-scale sequence analysis: compositional segmentation • TandemSWAN – tandem repeats in DNA sequences • SeSiMCMC – DNA motif identification • Oligomeasure – DNA structure from DNA sequence Novosibirsk,Indian_Russian Meeting, 2008

  31. TandemSWAN • Tandem repeats with substitution but without indels with a control of repeat statistical significance tttatttatttatttatttatttatttatttatttatttatttatttatttatttatttattta Finds micro- and minisatellites with substitutions Novosibirsk,Indian_Russian Meeting, 2008

  32. BAesianSegmentationInformationOptimizer Format the output List of segments Input sequence filter Remove short or redundent segments Basio – basic segmentaton algorithm Report Select the appropriate output format Split – sequence preprocessing • Performs DNA parsing into segments with a uniform composition • Uses Bayesian optimization over all possible segment configuration • Uses Bayesian Information Criterion (BIC) to control segmentation resolution atcatatca|ggcggcgcagccgcagcc|tctcttcttc Novosibirsk,Indian_Russian Meeting, 2008

  33. SeSiMCMC – Sequence Similarity Markov Chain Monte Carlo • SeSiMCMC – Gibbs sampler based algorithm for identification of binding motif • Multiple local alignment of candidate genomic sequences • Optimal identification of the motif length • Analysis of symmetries in motif structures • Priors for absent sites and sites at the forward and backward DNA strands Known binding site motif (SELEX, a sequence logofor Sp1 factor binding site) SeSiMCMC result on a TRANSFAC dataset Novosibirsk,Indian_Russian Meeting, 2008

  34. ALEX – Alingment of Exons Identifies exons in a genomic alignment CTGACGCACAGACCCAAGTGACGACGAGGCCGA CGGACGGACAGACCCAAGTGACGACGAGGCCGA Novosibirsk,Indian_Russian Meeting, 2008

  35. PROTEIN ANALYSIS • Struswer – Smith Waterman aligner taking into account the secondary structure • Prophet – Secondary structure predictor based on discriminate analysis • PSIC – multiple alignment with homologs Novosibirsk,Indian_Russian Meeting, 2008

  36. STRUSWER-STRUcture extension of Smith-Waterman alignER STRUSWER – alignment of protein sequences with the reference to their secondary structure Novosibirsk,Indian_Russian Meeting, 2008

  37. Protein Secondary Structure Prediction PROPHET Novosibirsk,Indian_Russian Meeting, 2008

  38. RNA-MBFS (RNA MultyBranch-Free Structures). Creates optimal RNA-structure without branching Novosibirsk,Indian_Russian Meeting, 2008

  39. Integration on the level of computation and data • Easy accessible via web interface • Integration at data level • Cluster and local network distributed computation support • Cross-platform Novosibirsk,Indian_Russian Meeting, 2008

  40. Building complex computational applications • Possibility to create individual scenarios for any special task • Pipelining support for computational conveyers • Simple XML-format for scenarios and conveyers descriptors Novosibirsk,Indian_Russian Meeting, 2008

  41. Individual user spaces and profiles • Individual user account support • Individual result storage and file library Novosibirsk,Indian_Russian Meeting, 2008

  42. Easy remote administration via web interface Novosibirsk,Indian_Russian Meeting, 2008

  43. What’s under the hood • Used technologies and program tools • MySQL 5 database as result and user space storage • JSP for web-interface • Apache Tomcat 5 JSP/Servlet container • Java 5 and RMI for distributed computations server and node-software Novosibirsk,Indian_Russian Meeting, 2008

  44. Acknowledgements • Financial support of • Russian Federation State Contract № 02.434.11.100 (Intellectual technologies 2). Prof. Tumanyan V.G. • Russian Academy of Sciences project in Molecular and Cellular Biology • Contributors Instutute of Mathematical Problems of Molecular Biology (Moscow Region, Puschino, Russia) Voronezh State University, Voronezh, Russia State Research Center of Genetics and Selection of Industrial Microorganisms, GosNIIgenetika, Moscow, Russia Novosibirsk,Indian_Russian Meeting, 2008

More Related