10 likes | 115 Views
A1Y A2X SPECIES X. A1Y A2Y SPECIES Y. Database Schema of Colipage. Proteins. Elementary families. Paralogous modules. Groups of elem. families. PID gene name length function sequence paralogue Y/N famnumber. famnumber number of members list of members group number. pidmod
E N D
A1Y A2X SPECIES X A1Y A2Y SPECIES Y Database Schema of Colipage Proteins Elementaryfamilies Paralogous modules Groups of elem. families PID gene name length function sequence paralogue Y/N famnumber famnumber number of members list of members group number pidmod PID gene name length function famnumber group number group number number of families list of families famnumber number of members gene name length function pidmod PID gene name length function PID1 PID2 gene name1 gene name 2 length1 length2 position1 position2 alignment1 aligment2 pam distance pam variance identify module1 module2 famnumber1 famnumber2 famnumber cluster number group Y/N number of members list of members group number pidmod PID gene name length function elem famnumber cluster number module class PID gene name modf famnbf groupnrf mod1 famnb1 groupnr1 mod2 famnb2 groupnr2 mod3 famnb3 groupnr3 PID gene name group number modular type nb of modules mod1 cluster1 mod2 cluster2 mod3 cluster3 modf clusterf pidmod PID gene name length function Refined families Refined families Modules in prot Modular struct Pairs Main experimental step Programme Language Written by 1. Collecting protein dataset and translation in SGML language sgml C Karim Abou-Merhy A Time 2. Detecting homologous proteins Darwin AllAll Maple and C Renaud De Rosa A1A2 Speciation Speciation 3. Detecting homologous modules Module Maple Hichem Arfaoui 4. Assembling homologous modules in families Families C Karim Abou-Merhy 4a. Grouping families sharing adjacent unrelated modules AdjacentFamilies C Adrien BazureauKarim Abou-Merhy • A1X, A2X, A1Y and A2Y homologue • A1X and A1Y orthologue • A1X and A2X paralogue • A1Y and A2Y paralogue 5. Inferring distant modules in multimodular proteins after grouping families and clustering them in new refined families Module SorterModule ClusterSortClust CCC++ Stéphanie LangevinIsabelle Montaland 6. Challenging deduced modules by computing distance matrix and evolutionary tree for each refined family DistTree C Adrien Bazureau Defining the different kinds of homology • Among the unique (non homologous) genes • The species-specific • The orthologues • Among the paralogues • The species-specific • The orthologues PROTEINS pid_prot gene_name length_prot funct seq right_end left_end orientation species 1 PAIRS position1 position2 alignment1 alignment2 pam_distance pam_variance identity Is compared to User’s queries page : in the text area you can directly enter your own SQL query . 0..1 1 Is a module of 1 MODULES pid_mod type_mod origin_mod family_mod 0..n Web site main page The left bar provides access to the functionalities of the whole site and is always visible. Structure of the Genopage database Find out all orthologous proteins made of two modules and separated by an evolutionary distance bounded by 50 and 75 PAM units Select pid_prot1, pid_prot2, p1.species as species1, p2.species as species2 From PAIRS, PROTEINS P1, PROTEINS P2 Where pam_distance >= 50 And pam_distance <= 75 And pid_prot1 in (Select pid_prot From MODULES Where MODULES.type_mod = 'mod12' Or MODULES.type_mod = 'mod22' ) And pid_prot2 in (Select pid_prot From MODULES Where MODULES.type_mod = 'mod12‘ Or MODULES.type_mod = 'mod22' ) And P1.pid_prot = pid_prot1 And P2.pid_prot = pid_prot2 And P1.species <> P2.species Local Client (database updating and access to site) PostgreSQL Server ProC Local Client (direct access to web site ) IGM HTML PHP IGM Users Web Server (Apache) HTML : data flow External client (access to web site only) What is the modular structure of each protein having an unknown function ? Select * From MODULES, PROTEINS Where PROTEINS.pid_prot = MODULES.pid_prot And PROTEINS.funct = ‘unknown’ A Database of all protein modules encoded by completely sequenced genomes Sarah Cohen Boulakia, Christine Froidevaux, Emmanuel Waller (LRI) Bernard Labedan (IGM) Laboratoire de Recherche en Informatique UMR CNRS 8623 Institut de Génétique et de Microbiologie UMR CNRS 8621 Abstract Comparative analysis of the whole set of proteins encoded by completely sequenced genomes allowed to identify segments of homology (modules) in many proteins. These modules are then clustered into families in order to progressively trace back the history of present-day proteins. We plan to assemble all these modules and their families in a relational database. In this paper, we present a brief description of the structure and functionalities of this base, Genopage. The database schema contains all required information in a concise structure built using PostgreSQL. Yet, it allows to pose complex queries which can be efficiently answered, providing relevant understanding of the biological phenomena. • Experimental Design • Finding out all segments of homology (modules) • Clustering of modules in family (numbering the ancestral modules) • Establishing the modular structure for each protein • Defining the orthologous relationships for unique and paralogous • genes in each genome Intra genomic data Inter genomic data Genopage What is Colipage? Colipage is the database of paralogous genes present in Escherichia coli. Colipage is based on the work initiated by Monica RILEY (MBL, Woods Hole,USA) and Bernard LABEDAN (IGM, Université Paris-Sud, Orsay, France). Colipage is based on an approach which has been designed in the group of "Evolution Moleculaire et Genomique" leaded by Bernard Labedan in order to find out all modules present in the paralogous proteins. What is Genopage ? Genopage is an extension of Colipage. Genopage is a database of the parental relationships found between genes present in microbial genomes which have been completely sequenced. Our aim is to assemble the maximum of intergenomic comparisons in order that, at the end, Genopage will display for each genome - beside the list of the genes which are unique to this genome - the list of the genes which have homologues in either the same genome (paralogues) or another genome (orthologues). Accordingly, four classes of genes will be delineated for each genome : • DBMS : File Maker Pro • Easy filling of the database • Able to store the whole set of data for one proteome such as E. coli • Allows simple queries • Simple CGI development tools (significant for the Web site) A new architecture Web site www-colipage.igmors.u-psud.fr The whole data storage is done with a PostgreSQL server. Data are updated using file import programs in ProC. These files are generated by sequences analysis algorithms. Data are accessible on a Web site that is used both for internal and external accesses. Web site HTML pages are generated from database using PHP technology. Now, we want to manage Very complex data Very numerous data Very expressive queries This query is equivalent to the one expressed in the form below. • Our choices • PostgreSQL (close to Oracle) is a DBMS (DataBase Management System) • able to store very numerous data (up to several hundred bacterial genomes) • able to formulate complex queries using full SQL expressiveness • PHP : language to generate dynamically HTML pages Increased need of visualization Example of queries • Perspectives • Improving the biological significance • Adding more information on genome organization • Introducing new data • gene context, lateral gene transfer (LGT), • species relationships, phylogeny of procaryotes… • Adding more information on proteins • (i) genealogical trees for each family of modules • (ii) more dynamical graphical representation of • the modular structure • Evolution of Genopage towards a knowledge base • Introducing more precise terms using an ontology approach • to conceive a knowledge base. Conclusion We think that Genopage might be useful not only to the community of genomicists but to many other biologists. This database is expected to further evolve into a knowledge base in order to become even more valuable. Protein form Visualization of proteins modular structure Université Paris Sud, Centre d’Orsay