440 likes | 519 Views
Combined network of transcription regulation and protein-protein interaction for inferring genome-wide functional linkages. Institute of Genetics and Selection of Industrial Microorganisms, Moscow, Russia. Russian-Indian Collaborating Project.
E N D
Combined network of transcription regulation and protein-protein interaction for inferring genome-wide functional linkages Institute of Genetics and Selection of Industrial Microorganisms, Moscow, Russia Novosibirsk,Indian_Russian Meeting, 2008
Russian-Indian Collaborating Project Kharkevich Institute of Information Transition Problems, Russian Academy of Sciences x Prof. Shekhar Mande State Research Center of Genetics and Selection of Industrial Microorganisms, Moscow, Russia Prof. Mikhail Gelfand Novosibirsk,Indian_Russian Meeting, 2008
Comparative genomics can show gene functional linkages • Co-occurrence in known operons • Minimal distance between a pair of genes in a genome (unknown operons) • Phylogenetics profiling (similar behaviour of a gene pair in several genomes) 266 linear genomes allow to evaluate functional linkages between genes by statistical methods Yellaboina et al. Genome Research, 2007 17: 527-535 Novosibirsk,Indian_Russian Meeting, 2008
What are the mechanisms behind “functionally related genes”? Protein-protein interactions obtained in high- throughoutput experimental methods correlate well with functional relatedness of genes obtained with bioinformatics. Protein-protein interaction… Yellaboina et al. Genome Research, 2007 17: 527-535 BUT … Metabolic pathways Or several simultaneously Transcription co-regulation Other extravagant mechanisms (direct interactions in genome) Novosibirsk,Indian_Russian Meeting, 2008
Bioinformatics of transcription regulation of bacterial genes • Specific promoters • RNA based switches • Specific protein transcription regulatory factors(TF) TF-mediated regulation is often responsible for regulation of complex processes Cross – talk Non-trivial concentration dependence (quorum Sensing) Novosibirsk,Indian_Russian Meeting, 2008
DNA-signals responsible for TF binding Bacterial regulatory sites: 1. Usually long and divergent 2. Often positioning referred to the promoter is important 3. Sites for crass-talking proteins may overlap Novosibirsk,Indian_Russian Meeting, 2008
Integrated database Functionally related genes Protein interactions Transcription associated genes Methabolic associated genes Novosibirsk,Indian_Russian Meeting, 2008
Bioinformatics for hierarchy of organization levels of biosystems 12 program components integrated into a single system DNA Sequence Sequence RNA Structure Sequence TandemSWAN, BASIO, ALEX, SeSiMCMC, STRUSWER, STRUDL, RNA-MBFS, Prophet, Oligomeasure, PSACR, Combinator, KMD Protein Structure Complex Variation between species and individuals in populations Novosibirsk,Indian_Russian Meeting, 2008
Some technical points Novosibirsk,Indian_Russian Meeting, 2008
Two integrated databases • Molecular entities • Genome annotations PathWay Studio, Ariadne Genomics, Inc Original database of genome annotation and transcription regulation Novosibirsk,Indian_Russian Meeting, 2008
Integration of data on binding sites and genome annotations • All experimental and predicted binding sites and other segments data are mapped into genome. • Filtration of multiple identical entries and obviously irrelevant sites in EcoCyc • Site positioning in relation with other genomic structures (repeats, genes) • Motifs are represented as lists of allowed words • Different experimental sources, as well as comparative genomics studies are used for motif construction Novosibirsk,Indian_Russian Meeting, 2008
Viewpoints • Database that contain the experimental data and computational predictions in the integrated manner • XML format for organizing data flow • Possible distributed computations • Possible platform independence (Ruby & Java) Novosibirsk,Indian_Russian Meeting, 2008
Unified storage for experimental data from different sources SELEX Genome Motif models Comparative genomics footprinting filtering identical and irrelevant motifs, preprocessing small-BiSMark XML-based small language for Biological Sequence Markup database engine Novosibirsk,Indian_Russian Meeting, 2008
Identification of optimal binding motifs using stochastic optimization • SeSiMCMC – Gibbs sampler based algorithm for identification of binding motif • Multiple local alignment of candidate genomic sequences • Optimization of the motif length • Modeling of diades (palindromes and tandem repeats) in motif structures • Priors for absent sites and sites at the forward and backward DNA strands Known binding site motif (SELEX, a sequence logofor Sp1 factor binding site) SeSiMCMC result on a TRANSFAC dataset Novosibirsk,Indian_Russian Meeting, 2008
SeSiMCMC sampler page Novosibirsk,Indian_Russian Meeting, 2008
Identification of spaced and overlapping motifs Novosibirsk,Indian_Russian Meeting, 2008
Regulatoryregions: Different types of architecture Overlapping and spaced binding sites Homotypic Clusters Clusters aligned with promoters ArcA sites Promoter Novosibirsk,Indian_Russian Meeting, 2008
Statistical validation of selectivity and identification of optimal binding motifs • AhoPro algorithm for calculation of P-value of site binding • Comparison of different binding motifs • Using different motif models • Selection of the optimal motif • Direct calculation of motif selectivity for different specificity levels • Motif models support includes • Positional weight matrices • Word lists • IUPAC strings Novosibirsk,Indian_Russian Meeting, 2008
Aho-Pro algorithm: exact P-value calculation • Aho-Corasick pattern matching automaton root C A H1 = {ACC, AСT} H2 = {AT, CT} T C T T • Each state at ith step – • classCi (r1, r2;q) C • Probabilities to be at each state (probability transducer) the longest suffix in prefix closure of H1UH2 number of occurrences of the second motif step (text length) number of occurrences of the first motif Novosibirsk,Indian_Russian Meeting, 2008
AhoPro – p-value calculator! We developed an algorithm ofexact p-value calculation for multiple occurrences of multiples motifs Boeva, V., J. Clement, M. Regnier, M.A. Roytberg, and V.J. Makeev. 2007. Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules. Algorithms Mol Biol 2: 13. Novosibirsk,Indian_Russian Meeting, 2008
A data flow for motif model construction Footprinting results SELEX ChIP-chip Genome-mapped with correct flanking sequences Short sites or site parts Raw long sequences May be used as mask To be used as initial mask SeSiMCMC SeSiMCMC Additional motif length estimation Additional motif length estimation Sp1 binding site Motif model Motif model Verification Novosibirsk,Indian_Russian Meeting, 2008
Obtaining clean data from specific sources Using TRANSFAC as base data source for binding sites of a selected factor small-BiSMark database engine Novosibirsk,Indian_Russian Meeting, 2008
A verification procedure for created motif model New motif model Testing sequence set Wisely chosen set of motif-containing sequences AhoPro Selectivity testing Choosing optimal motif specificity Processed experimental data (via SeSiMCMC) Newly discovered motif (by SeSiMCMC or ScanSeq) Novosibirsk,Indian_Russian Meeting, 2008
Comparative motif analysis Testing sequence set New motif model Footprinting, ChIP-chip data, Random generated set Known motif model 1 Known motif model 2 Known motif model 3 AhoPro Selectivity testing Comparative analysis Selecting best motif model Novosibirsk,Indian_Russian Meeting, 2008
An a genome-wide motif distribution mapping New motif model Known motif model 1 Genome-wide globally positioned on chromosome sites with different quality Known motif model 2 Known motif model 3 Possible clustering of sites: different models for one factor best models for different factors Positioning within specific DNA regions: CRM, CpG islands, etc. Novosibirsk,Indian_Russian Meeting, 2008
Distributed computations support «Theatre manager» Possible multiple Opera House management for grid computing support (request redirecting and resource balancing only) «Opera House» «Opera House» Single physical machine multi-process remote task execution control service «Opera House» Physical machine Main database Specified scenario «opera libretto» execution Novosibirsk,Indian_Russian Meeting, 2008
Overview of the technical realization of the complex Database level Application level MySQL SeSiMCMC and AhoPro High-speed C++ code Server level Ruby-powered cross-platform DRb-based server Web-interface level Ruby-based CGI Ruby-on-rails in future Data-workflow level Ruby-powered cross-platform scenario scripts small-BiSMark processing Ruby and Java-based tools (REXML, JAXP, SAXON) Novosibirsk,Indian_Russian Meeting, 2008
THE END Novosibirsk,Indian_Russian Meeting, 2008
Acknowledgments • GosNIIgenetika group: • Vsevolod Makeev • Alexander Favorov • Elizaveta Permina • Valentina Boeva • Ivan Kulakovsky • Dmitry Malko Financial support Russian Federation State Innovation Project Russian Foundation of Basic Research DST India Novosibirsk,Indian_Russian Meeting, 2008
Biological data analysis components • DNA analysis: • Basio – large-scale sequence analysis: compositional segmentation • TandemSWAN – tandem repeats in DNA sequences • SeSiMCMC – DNA motif identification • Oligomeasure – DNA structure from DNA sequence Novosibirsk,Indian_Russian Meeting, 2008
TandemSWAN • Tandem repeats with substitution but without indels with a control of repeat statistical significance tttatttatttatttatttatttatttatttatttatttatttatttatttatttatttattta Finds micro- and minisatellites with substitutions Novosibirsk,Indian_Russian Meeting, 2008
BAesianSegmentationInformationOptimizer Format the output List of segments Input sequence filter Remove short or redundent segments Basio – basic segmentaton algorithm Report Select the appropriate output format Split – sequence preprocessing • Performs DNA parsing into segments with a uniform composition • Uses Bayesian optimization over all possible segment configuration • Uses Bayesian Information Criterion (BIC) to control segmentation resolution atcatatca|ggcggcgcagccgcagcc|tctcttcttc Novosibirsk,Indian_Russian Meeting, 2008
SeSiMCMC – Sequence Similarity Markov Chain Monte Carlo • SeSiMCMC – Gibbs sampler based algorithm for identification of binding motif • Multiple local alignment of candidate genomic sequences • Optimal identification of the motif length • Analysis of symmetries in motif structures • Priors for absent sites and sites at the forward and backward DNA strands Known binding site motif (SELEX, a sequence logofor Sp1 factor binding site) SeSiMCMC result on a TRANSFAC dataset Novosibirsk,Indian_Russian Meeting, 2008
ALEX – Alingment of Exons Identifies exons in a genomic alignment CTGACGCACAGACCCAAGTGACGACGAGGCCGA CGGACGGACAGACCCAAGTGACGACGAGGCCGA Novosibirsk,Indian_Russian Meeting, 2008
PROTEIN ANALYSIS • Struswer – Smith Waterman aligner taking into account the secondary structure • Prophet – Secondary structure predictor based on discriminate analysis • PSIC – multiple alignment with homologs Novosibirsk,Indian_Russian Meeting, 2008
STRUSWER-STRUcture extension of Smith-Waterman alignER STRUSWER – alignment of protein sequences with the reference to their secondary structure Novosibirsk,Indian_Russian Meeting, 2008
Protein Secondary Structure Prediction PROPHET Novosibirsk,Indian_Russian Meeting, 2008
RNA-MBFS (RNA MultyBranch-Free Structures). Creates optimal RNA-structure without branching Novosibirsk,Indian_Russian Meeting, 2008
Integration on the level of computation and data • Easy accessible via web interface • Integration at data level • Cluster and local network distributed computation support • Cross-platform Novosibirsk,Indian_Russian Meeting, 2008
Building complex computational applications • Possibility to create individual scenarios for any special task • Pipelining support for computational conveyers • Simple XML-format for scenarios and conveyers descriptors Novosibirsk,Indian_Russian Meeting, 2008
Individual user spaces and profiles • Individual user account support • Individual result storage and file library Novosibirsk,Indian_Russian Meeting, 2008
Easy remote administration via web interface Novosibirsk,Indian_Russian Meeting, 2008
What’s under the hood • Used technologies and program tools • MySQL 5 database as result and user space storage • JSP for web-interface • Apache Tomcat 5 JSP/Servlet container • Java 5 and RMI for distributed computations server and node-software Novosibirsk,Indian_Russian Meeting, 2008
Acknowledgements • Financial support of • Russian Federation State Contract № 02.434.11.100 (Intellectual technologies 2). Prof. Tumanyan V.G. • Russian Academy of Sciences project in Molecular and Cellular Biology • Contributors Instutute of Mathematical Problems of Molecular Biology (Moscow Region, Puschino, Russia) Voronezh State University, Voronezh, Russia State Research Center of Genetics and Selection of Industrial Microorganisms, GosNIIgenetika, Moscow, Russia Novosibirsk,Indian_Russian Meeting, 2008