Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognitio

Unreveiling new biological knowledge from multiresolution structural proteomics data:A Data Base and Pattern Recognition Approach José María Carazo BioComputing Unit, Centro Nacional de Biotecnología, Madrid, Spain

(Who am I?) Research Areas Helicase Struc/Func. Analysis Image Processing Structural Databases

Hypothesis:Medium resolution EM data represents a rich biological information resource.Therefore: • Step 1) Keep them organized (institutionally) in a new structural data base (do not loose them. Keep them organized and accesible) • Step 2) Extract the now appearing Macro-architecture features (realize the general organizational principles of large assemblies) • Step 3) Make the “link” to structural proteomics at the aminoacid level (go from “density blobs” to defined protein structures. “Connect” atomic resolution information with “medium resolution) • Step 4) Integrate this new structural information with other information sources

Step 1: Motivated by impetus in cryo EM “Construct the EM Data Base (EMDB)” • The work started in 97 with the “BioImage” project of the EU as pilot study among research groups • The work continued through 2000-2003 in the IIMS project, creating the EM Data Base as part of the core facilities of the EBI (European BioInformatics Institute)

EMDB • IIMS: to integrate the results of three-dimensional electron microscopy (3D-EM) with models from X-ray and NMR methods. • Part of the MSD (Macromolecular Structure Database) The project is funded by the European Commission as the IIMS,contract-no. QLRI-CT-2000-31237 under the RTD programme "Quality of Life and Management of Living Resources"

EMDB • Relational Data Model • Fully integrated in the MSD, together with PDB data • XML-based Data Model • EMDep, the Electron Microscopy Deposition Tool • Dictionary driven

IIMS Workshop November 15-16, 2002 We note that the European Bioinformatics Institute (EBI) through the Macromolecular Structure Database (MSD) now provides a permanent resource for the deposition of three-dimensional maps derived by electron microscopy (see www.ebi.ac.uk/msdsrv/emdep). In addition, coordinate data derived from these maps are deposited in the PDB archive for macromolecular structural data. We intend to use these facilities for the routine deposition of maps and coordinate data produced by our work. These databases are open to the international community and will become part of a family of linked databases in biomedical research. We encourage our colleagues to follow our example by submitting maps, at the stage of publication, to these archival databases.

Sending data to EMD

… more than a hundred EM structures are now being published in the journals in a typical year. Without EMDB, these data would not be archived for future general use. So the size and usefulness of the database are likely to increase dramatically. Nature Structural Biology is strongly supportive of the general principle that scientific data should be professionally maintained and freely accessible, and so its editors will from now on encourage scientists to deposit their work in EMDB when papers describing EM structures are published in the journal.

Step 2: Discover biological Knowledge: “Extract information on general organizational principles” • GOAL: Since EM provides information on (potentially) quite large specimens, device ways to extract automatically topological and geometrical information of the assemblies • Driven principle: In order to close gaps between differentn techniques of structure determination such as X-rays and cryo-EM, develop techniques able to work transparently accross multiple resolution levels ( HERE COME “ALTERNATIVE REPRESENTATIONS”)

FEMME FEMME Database Purpose: to store, in a universal data model, the topological andgeometric features of 3D-reconstructed macromolecules regardless of the resolution achieved. Methodology: Vector quantization and alpha-shape representation theory J.Struct. Biol, 2004 Final aim: Automatic detection of general organizational principles Query by content in structural databases.

pseudo-atoms ALPHA COMPLEX Methodology Original dataset:Set of multimeric proteins coming from PDB/PQS databases (High resolution) 3D-EM (Medium resolution) Macromolecular topology given by the selection of a set of pseudoatoms (De-Alarcón et al 2002) Macromolecular topology given by the atomic coordinates (Liang et al 1998) IDENTIFICATION, EXTRACTION AND CHARACTERISATION OF CHANNELS/CAVITIES/(PROTUSSIONS)

FEMME contents Around 140 entries corresponding to alpha-shape representations of macromolecules and macromolecular structural features from data at any resolution level Detailed description about the number and kind of structural features contained in the macromolecule One of the possible applications: detection of shape similarities among complexes

TRICORN PROTEASE CCT RIBOSOME ACTIN Final aim Several descriptors of the macromolecule structure Shape, Size, Protrusions, Channels, Cavities ... FEMME DATABASE STORAGE Structurally characterised macromolecule Query by content

Step 3: Discover biological knowledge: “Make the “link” at the aminoacid level” (Quantitative “visualization” of fine features) • Goal: Bridging from atomic resolution to medium resolution • Motivation: At some moment the link from “density blobs” to define aminoacids has to done. This is so in order to “attach” biochemical and functional information to the medium resolution structures. • Note: There are many substeps here, we will concentrate on “superfamily recognition” (and in cooperation with other groups in the field, like Chiu’s group)

Superfamily recognition A working definition of Superfamily recognition: • Identification of the SSE elements of a protein • Their spacial distribution and conectivity (topology) • Assignment of a structural family to the protein • Assignment of a sequence family to the protein • Assignment of a function Increasing difficulty Information that can be used : • Protein sequence/atomic resolution information: A bunch of methods: neural networks, threading, etc • Medium resolution views of the protein = 3DEM maps • Is surface information enough to detect a fold ? • Can we detect the fold present in an 3DEM map just docking other known fold maps in it ? • Can some form of flexible docking using SSE be of help?

What are we doing ? • Is surface information enough to help assigning a superfamily ? • Application of the spin-image-representation method by De Alarcon, P.A. Y Pascual-Montano, A. • Can we assign a superfamily in an 3DEM map just docking other known fold maps in it ? • Application of the COAN docking method by Volkmann, N. within a new Bayesian Schema • Can we assign a superfamily by some form of flexible docking, possibly using SSS elements ? • Work in progress

Superfamily assignmentusing surface information • Surface information can give information about similarity between different folds. • Surface comparison can be performed using techniques derived from the field of computer vision. • Our studies reveal that similar folds according to the classification given by CATH (belonging to the same superfamily) also have similar surfaces at different resolutions ranging from 8 to 12 Å. • Similarities in the surface are related to similarities in the fold sequence of aminoacids. • The surface info can be used to detect folds or entire proteins in large assemblies.

Spin image representation (s.i.r.) of 3D-EM Maps Spin-image-representation of a 3D object: C B A D n • s.i.r principle: to project every point x of the surface with respect to the plane defined by a p point and its normal n. • a 3D object with a point and a its normal. • Points of a surface projected into a plane. • Spin image obtained from the binning of the surface points projection.

Coloured Patches Query Plane 1st match 2nd match 3rd match Applications: Partial Matching. • Local patches of the query object can be highlighted according to local similiarity with objects in the database.

Proteins instead of airplanes….(dealing with multiple domains) • Possibility of docking isolated domains into entire maps • Take into account the surface info • Speed • Modularity

Fold recognitionusing fitting information • Docking information can be used to detect the CATH superfamily of a single fold present in a electron microscopy map. • Repeated experiments of cross correlation and a bayesian probability framework have been use. • The results show that the use of multiple dockings can overcome the uncertainty when the fold present in the 3D-EM is unknown.

Fold recognition using docking info and bayesian probability Bayesian probability the probability of having a fold given a density map background probability of having an individual fold i,computed as the frequency of realizations of that fold in the total data set of structures to dock. • probability of having a density map given a fold i, computed as follows: • a set of elements of the CATH superfamily that represents the fold are docked to the density map. • The probability that the density map belongs to that fold is computed as the probability that the sample values of cross-correlation came from the same population than the sample of cross-correlations from the elements of the CATH superfamily. • This test of homogeneity is done by a chi-squared test. The fold with the highest value ofis assigned to the map.

Fold recognition using docking info. Results: At 12 Å resoltuion the information content is a very discriminant measure. 8 of 9 experiments detect the correspondig family with the best value. Example: 12 Å

Fold recognitionExtension of the work to multidomain maps Can a single fold be detected in the entire electron microscopy map? The cross correlation approach fails in many cases Correct position Position found by cross correlation

Fold recognitionFlexible docking • By flexible docking we mean to deform ceartain points in the fold to better resemble what we have in the medium resolution density. • The important points chosen to deform are those points located at the ends of the secondary structure elements of the fold. • To allow for deformations we need to consider different alternatives for each point and choose those ones which better respect the fold superfamily arquitecture. But it doesn´t need to be very same.

Step 4: Discover biological knowledge: “Integrate information” • Goal: Integrate structural information at all levels of resolution with other sources of information • Mean: Semantic mediation over heterogeneous data sources • Obviously, this is a necessary step towards new powerful data mining approaches, and in data mining the “user” should be in the analysis loop via some graphical interface

PQS database CATH/SCOP databases FEMME database DNA clamp fold multimeric structure Central channel Motivating example: DNA binding macromolecules Multimeric structures containing the DNA clamp foldand with a central channel

Ultimate mean: Semantic Data Mediation • Programmable integrator • Interlieves information access and algorithm execution • Semantic mediator • Encodes and executes domain-specific expert-rules for data joining

Extended Domain Map in a Structural Biology Context Cavity/Channels Protrusion X,Y,Z Has Derive+ 3D Point Derive+ Has + Has Alpha-shape Derive Curvature Triangulated Surface Has Has Connectiviy Normal Has+ Has CATH Superfamily Properties (area, …) Found_in+ Superfamily detector Medium-Resolution 3D Image Found_in+ Fold Instance Has Fold hunter Found_in+ name Helix hunter Beta hunter Has + SSE Has + Has + My_function My_Polypeptide chain PDB My_protein Has * Has + Has * Enzyme Database PQS Has + InterPro Swissprot Red-framed boxes require visualization tools!!

Current state: PLAN – a Language for a Programmable Integrator • XML-based language • XQuery Retrieve those folds in CATH corresponding to proteins which contain a given InterPro motif (IPR001198) InterPro http://www.ebi.ac.uk/interpro SwissProt matches BLASTp search CATH Domain Description File PDB chains PLAN Example CATH codes

W.S.J. Valdar, J.M. Thornton, Protein–Protein Interfaces: Analysis of Amino Acid Conservation in Homodimers PROTEINS: Structure, Function, and Genetics 42:108–124 (2001) • the protomer to be studied must form a stable, symmetric complex with one other protomer to which it is identical (or nearly identical) such as the oligomer is homodimeric and the conservation of only one chain need be considered; • the full wild-type complex must be available in PDB or PQS; • of all the structures available for the complex, the structure chosen must have the best combination of the following properties: • high resolution, inclusion of any bound cofactors that occur naturally, the inclusion of a ligand similar in size and shape to that of the natural substrate. • to enable the robust identification of a diverse set of homologues, the promoter should be represented in the CATH • the promoter sequence must have non-fragment homologues in the SwissProt that are numerous (>10) and diverse (<70% mean pairwise sequence identity), and by their annotation, share its function and multimeric state

Data sources Operation Criteria PQS 1. The oligomer is homodimeric CATH 2. Available in CATH BLAST 3. Group by protein 3a. Numerous distant homologues 3b. Wild-type protein 4. Share multimeric state SwissProt Collection 5. Final selection PDB, ENZYME Filtering

PLAN Example (I) URL constructor <QUERY> <result> LET $x := set("","ipr","IPR001198"), $x := set($x,"display","n"), $x := set($x,"dmax","20000"), $y := constructURL("GET","http://www.ebi.ac.uk/interpro/ISpy",$x) RETURN $y </result> </QUERY> <TRAVERSE>POP</TRAVERSE> <QUERY> <result> <DATA NAME="InterProMatches" TYPE="Add"> RETURN stream() </DATA> </result> </QUERY> Wrapper call Internal data buffer (allows XML filtering)

PLAN Example (II) Working register is… <XMLBUFFER NAME="InterproMatches" /> Nesting requests <WHILE> <CONDITION> <STACK> <CONDITION>NONEMPTY</CONDITION> </STACK> </CONDITION> <DO> <TRAVERSE>POP</TRAVERSE> <QUERY> <result> <DATA NAME="spToPdb" TYPE="Add"> RETURN stream() </DATA> </result> </QUERY> </DO> </WHILE> <CONSTRUCT> <DATA NAME="r1" /> </CONSTRUCT> <DELETE FILE="./resultFiles/q1_IPR001198.xml" /> <PRINTOUT FILE="./resultFiles/q1_IPR001198.xml" /> Save result data in a file

Final Remark: “Infrastructures” • All our software is public domain and with a sustained tradition of making it really accesible (XMIPP, BPR…)

The CNB Biocomputing Unit: L.E.Donate Mikel Valle Carmen San Martin María Gómez Yolanda Robledo Rafael Núñez Yacob Monica Chagoyen Roberto Marabini Alberto Pascual Carlos-Oscar Sanchez Natalia Jiménez-Lozano Javier A. Velázquez-Muriel Pedro Carmona David Elguero Jesus Cuenca Extra mural: The EBI Team Herbert Edelsbrunner Wah Chiu’s Lab SDSC (Gupta’s Lab) Ioannis Kakadiaris’s Lab Niels Voksmann Gruss and Cheng Lab Mark Ellisman Lab (and MANY other interactions) Acknowledgements

Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognitio

Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognitio

Presentation Transcript

Proteomics Data analysis

DISASTER DATA BASE: A New Experimental Model

Standards and gene expression data – from data archiving to extracting biological knowledge

Data Base

Data standards from the Proteomics Standards Initiative

ETIS+ knowledge base Data viewing and retrieval

From Data to Knowledge

From Data to Knowledge

Data Validation and Annotation: PRIDEViewer and PIKE Bioinformatics analysis from proteomics data

Managing Biological Data and Data

DATA MINING Extracting Knowledge From Data

From Data to Knowledge

Structural proteomics

Data Mining: Extracting Knowledge from Past Data

Workshop Structural Proteomics of Biological Complexes

Data Management and Data Base Issues

Workshop on Structural and Computational Proteomics of Biological Complexes