320 likes | 453 Views
the European Macromolecular Structure Database (EMSD). http://www.ebi.ac.uk/msd/roadshow.html MSD Roadshow Co-ordinator . Janet Copeland 2 nd November 2005 Oxford University. http://www.ebi.ac.uk/msd/education/Tutorial.html. USERNAME: cal PASSWORD: warthog.
E N D
the European Macromolecular Structure Database (EMSD). http://www.ebi.ac.uk/msd/roadshow.html MSD Roadshow Co-ordinator . Janet Copeland 2nd November 2005 Oxford University http://www.ebi.ac.uk/msd/education/Tutorial.html
USERNAME: cal PASSWORD: warthog
Introduction to MSD and to Quaternary Structures/Assemblies as Basis of MSD database • SSM Fold recognition • PISA Surface and assembly toolkit • MSDchem Chemistry reference data • MSDlite/MSDpro generalised search systems • MSDsite Active sites • MSDmotif small structural motifs
Visualisation and Patterns • Intergration Projects with Sequence and Domain data • Validation/Deposition • Clustering methods used at MSD • MSDmine – generalised data access to the MSD • PIMS – Protein Information System • Targets – Workflow for Target selection tools • NMR – NMR tools and data at MSD • Data Mining and an example MSDtemplates • DataBases at MSD including data warehouse technologies • DataBase Replication
Bioinformatics Genomes Literature Expression- profiling Metabolic data Proteome data Biochemistry Bioinformatics Comparative genomics Mutant/RNAi data Hypotheses and in silico models
Role of Bioinformatics To Support Experimental Biology To Collect and Archive Data To provide Framework and Integration To give Easy Access to Data To make New Discoveries through Data Analysis
Databanks and Databases The PDB Archive is a “databank” A series of flat files that have a format originally designed for Fortran card readers The MSD provides “databases” Collections of data (1000’s attributes) organized into relational tables and held with a RDMS.
Linking to Domain data, eFamily Sequence Mapping, SIFTS MSDchem ligand data PQS biological assemblies Electron Density Visualisation AstexViewer MSDPro, MSDlite MSDsite Active sites SSM fold matching Surface Matching
Data & information ATOM 2567 N PHE B 175 7.821 -25.530 -22.848 1.00 8.71 ATOM 2568 CA PHE B 175 8.845 -25.172 -21.877 1.00 9.41 ATOM 2569 C PHE B 175 9.449 -23.798 -22.169 1.00 10.02 ATOM 2570 O PHE B 175 10.664 -23.613 -22.103 1.00 10.37 ATOM 2571 CB PHE B 175 9.928 -26.251 -21.848 1.00 9.53 ATOM 2572 CG PHE B 175 10.969 -26.137 -22.982 1.00 10.03 ATOM 2573 CD1 PHE B 175 12.356 -25.819 -22.988 1.00 10.51 ATOM 2574 CD2 PHE B 175 11.725 -27.211 -23.402 1.00 10.25 ATOM 2575 CE1 PHE B 175 11.821 -27.095 -22.869 1.00 11.17 ATOM 2576 CE2 PHE B 175 12.282 -26.086 -24.008 1.00 10.95 ATOM 2577 CZ PHE B 175 10.953 -26.335 -23.622 1.00 11.38
MSD service provider We provide a service to the scientific community 24/7 (almost) : parallel DB with fail-over, etc. Service “ping” baseline check several times/day Data is incremented with new data weekly Systems are extensible
Query capabilities Browsing (click and read) Simple search select records with some constraints More elaborate search select specific fields of some records with constraints on some fields Complex querying ability to return an answer that results from a "live" computation, and was not part of any record of the database
What we cannot do well “Give us sequence, we do rest”
What is the function of this structure? What is the function of this sequence? What is the function of this motif? • the fold provides a scaffold, which can be decorated in different ways by different sequences to confer different functions - knowing the fold & function allows us to rationalise how the structure effects its function at the molecular level
ATPase 1H8E (ADP.ALF4)2(ADP.SO4) BOVINE F1-ATPASE (ALL THREE CATALYTIC SITES OCCUPIED) MENZ, R.I., WALKER, J.E., LESLIE, A.G.W.
Ground rules for bioinformatics Don't always believe what programs tell you they're often misleading & sometimes wrong! Don't always believe what databases tell you they're often misleading & sometimes wrong! Don't always believe what lecturers tell you they're often misleading & sometimes wrong! In short, don't be a naive user when computers are applied to biology, it is vital to understand the difference between mathematical & biological significance computers don’t do biology - they do sums quickly!
General Evaluation Criteria Be sceptical and cynical! • When you are searching for information you need to judge its quality and suitability. • Think critically about each piece of information you find and how you found it. • Relevance: • Does the information you have found adequately support your research? • Does it answer the question, or support one of your arguments? • How general or specific is the information about the topic?
http://harvester.embl.de/ “Harvester” collects information from selected public databases
Appreciate how difficult it is to draw a complex 3-D object and appreciate the complexity of the requirements for storing sequence and structural information of molecules in a database. • There are a lot of interrelated pieces of information about a biomolecule, such as • sequence similarities • genome location • protein structure • Expression • chemistry
Some of the obstacles of searching databases are: • Data formats are not standard. • The nomenclature is not standard. • There is more than one database offering the same information (data redundancy). • Links between databases may not be easy to follow. • The number of databases available makes it confusing to choose from
Accuracy or Validity You need to determine whether the information is reliable or not
Quality Control Issues The quality of archived data is no better than the data determined in the contributing laboratories. Curation of the data can help to identify errors. Disagreement between duplicate determinations is a clear warning of an error in one or the other. Similarly, results that disagree with established principles may contain errors. It is useful, for instance, to flag deviations from expected stereochemistry in protein structures, but such ``outliers'' are not necessarily wrong.
Data quality Data Consistency Data Models Reliability Evidences ? Level of confidence ? Assignation of function by similarity recursive process propagation of errors
Data quality It’s hard to judge whether something “makes sense”. The lack of labeling on many web pages makes it hard to know the source. Calculations based on databases are even harder to deal with Logical deductions may be worse. “tacR gene regulates the human nervous system” “tacQ gene is similar to tacR but is found in E. coli” “so tacQ gene regulates the E. coli nervous system”
Who spotted ? E. coli nervous system
Significance Appreciating that mathematical & biological significance are different is crucial Important in understanding the limitations of database search algorithms multiple sequence alignment algorithms pattern recognition techniques functional site & structure prediction tools Contrary to popular opinion, there is currently still no biologically-reliable automatic multiple alignment algorithm no infallible pattern-recognition technique no reliable gene, function or structure prediction algorithm
As a result, we will have to give up the ``safe'' idea of a stable databank composed of entries that are correct when they are first distributed in mature form and stay fixed thereafter. Databanks are dynamic in information content and growing in size, and maturing in quality. Maintaining local copies – largely “top up” this is not sufficient. Proliferation of various copies in various states with out-of-date linkages New Problems