1.23k likes | 1.24k Views
This presentation provides an overview of the problem of connecting various resources in chemical informatics and the solution of building a distributed computing environment. It explores the use of web services and grid resources, as well as domain-specific tools and standards.
E N D
Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University
Acknowledgments • CICC researchers and developers who contributed to this presentation: • Prof. Geoffrey Fox, Prof. David Wild, Prof. Mookie Baik, Prof. Gary Wiggins, Dr. Jungkee Kim, Dr. Rajarshi Guha, Sima Patel, Smitha Ajay, Xiao Dong • Thanks also to Prof. Peter Murray Rust and the WWMM group at Cambridge University • More info: www.chembiogrid.org and www.chembiogrid.org/wiki.
Chemical Informatics and the Grid An overview of the basic problem and solution
Chemical Informatics as a Grid Application • Chemical Informatics is the application of information technology to problems in chemistry. • Example problems: managing data in large scale drug discovery and molecular modeling • Building Blocks: Chemical Informatics Resources: • Chemical databases maintained by various groups • NIH PubChem, NIH DTP • Application codes (both commercial and open source) • Data mining, clustering • Quantum chemistry and molecular modeling • Visualization tools • Web resources: journal articles, etc. • A Chemical Informatics Grid will need to integrate these into a common, loosely coupled, distributed computing environment.
Problem: Connecting It Together • The problem is defining an architecture for tying all of these pieces into a distributed computing system. • A “Grid” • How can I combine application codes, web resources, and databases to solve a particular problem that interests me? • Specifically, how do I build a runtime environment that can connect the distributed services I need to solve an interesting problem? • For academic and government researchers, how can I do all of this in an open fashion? • Data and services can come from anywhere • That is, I must avoid proprietary infrastructure.
NIH Roadmap for Medical Researchhttp://nihroadmap.nih.gov/ • The NIH recognizes chemical and biological information management as critical to medical research. • Federally funded high throughput screening centers. • 100-200 HTS assays per year on small molecules. • 100,000’s of small molecules analyzed • Data published, publicly available through NIH PubChem online database. • What do you do with all of this data?
High-Throughput Screening Testing perhaps millions of compounds in a corporate collection to see if any show activity against a certain disease protein
High-Throughput Screening • Traditionally, small numbers of compounds were tested for a particular project or therapeutic area • About 10 years ago, technology developed that enabled large numbers of compounds to be assayed quickly • High-throughput screening can now test 100,000 compounds a day for activity against a protein target • Maybe tens of thousands of these compounds will show some activity for the protein • The chemist needs to intelligently select the 2 - 3 classes of compounds that show the most promise for being drugs to follow-up
Informatics Implications • Need to be able to store chemical structure and biological data for millions of data points • Computational representation of 2D structure • Need to be able to organize thousands of active compounds into meaningful groups • Group similar structures together and relate to activity • Need to learn as much information as possible(data mining) • Apply statistical methods to the structures and related information • Need to use molecular modeling to gain direct chemical insight into reactions.
The Solution, Part I: Web Services • Web Services provide the means for wrapping databases, applications, web scavengers, etc, with programming interfaces. • WSDL definitions define how to write clients to talk with databases, applications, etc. • Web Service messaging through SOAP • Discovery services such as UDDI, MDS, and so on. • Many toolkits available • Axis, .NET, gSOAP, SOAP::Lite, etc. • Web Services can be combined with each other into workflows • Workflow==use case scenario • More about this later.
Basic Architectures: Servlets/CGI and Web Services Browser Browser GUI Client Web Server HTTP GET/POST WSDL SOAP Web Server WSDL Web Server WSDL WSDL SOAP JDBC JDBC DB DB
Solution Part II: Grid Resources • Many Grid tools provide powerful backend services • Globus: uniform, secure access to computing resources (like TeraGrid) • File management, resource allocation management, etc. • Condor: job scheduling on computer clusters and collections • SRB: data grid access • OGSA-DAI: uniform Grid interface to databases. • These have Web Service as well as other interfaces (or equivalently, protocols).
Solution, Part III: Domain Specific Tools and Standards -->More Services • For Chemical Informatics, we have a number of tools and standards. • Chemical string representations • SMILES, InChI • Chemistry Markup Language • XML language for describing, exchanging data. • JUMBO 5: a CML parser and library • Glue Tools and Applications • Chemistry Development Kit (CDK) • OpenBabel • These are the basis for building interoperable Chemical Informatics Web Services • Analogous situations exist for other domains • Astronomy, Geosciences, Biology/Bioinformatics
Solution Part IV: Workflows • Workflow engines allow you to connect services together into interesting composite applications. • This allows you to directly encode your scientific use case scenario as a graph of interacting services. • There are many workflow tools • We’ll briefly cover these later. • General guidance is to build web services first and then use workflow tools on top of these services. • Don’t get married to a particular workflow technology yet, unless someone pays you.
Solution Part V: User Interfaces • Web Services allow you to cleanly separate user interfaces from backend services. • Model-view-controller pattern for web applications • Client environments include • Grid and web service scripting environments • Desktop tools like Taverna and Kepler • Portlet-based Web portal systems • Typically, desktop tools like Taverna are used by power users to define interesting workflows. • Portals are for running canned workflows.
Next steps • Next we will review the online data base resources that are available to us. • Databases come in two varieties • Journal databases • Data databases • As we will discuss, it is useful to build services and workflows for automatically interacting with both types.
MEDLINE: Online Journal Database • MEDLINE (Medical Literature Analysis and Retrieval System Online) is an international literature database of life sciences and biomedical information. • It covers the fields of medicine, nursing, dentistry, veterinary medicine, and health care. • MEDLINE covers much of the literature in biology and biochemistry, and fields with no direct medical connection, such as molecular evolution. • It is accessed via PubMed. http://en.wikipedia.org/wiki/Medline
PubMed: Journal Search Engine • PubMed is a free search engine offered by the United States National Library of Medicine as part of the Entrezinformation retrieval system. • The PubMed service allows searching the MEDLINE database. • MEDLINE covers over 4,800 journals published in the United States and more than 70 other countries primarily from 1966 to the present. • In addition to MEDLINE, PubMed also offers access to: • OLDMEDLINE for pre-1966 citations. • Citations to articles that are out-of-scope (e.g., general science and chemistry) from certain MEDLINE journals • In-process citations which provide a record for an article before it is indexed with MeSH and added to MEDLINE • Citations that precede the date that a journal was selected for MEDLINE indexing • Some life science journals http://www.ncbi.nlm.nih.gov/entrez/query/static/overview.html
PubChem: Chemical Database • PubChem is a database of chemicalmolecules. • The system is maintained by the National Center for Biotechnology Information (NCBI) which belongs to the United States National Institutes of Health (NIH). • PubChem can be accessed for free through a web user interface. • And Web Services for programmatic access • PubChem contains mostly small molecules with a molecular mass below 500. • Anyone can contribute • The database is free to use, but it is not curated, so value of a specific compound information could be questionable. • NIH funded HTS results are (intended to be) available through pubchem. http://pubchem.ncbi.nlm.nih.gov/
NIH DTP Database • Part of NIH’s Developmental Therapeutics Program. • Screens up to 3,000 compounds per year for potential anticancer activity. • Utilizes 59 different human tumor cell lines, representing leukemia, melanoma and cancers of the lung, colon, brain, ovary, breast, prostate, and kidney. • DTP screening results are part of PubChem and also available as a separate database. http://dtp.nci.nih.gov/
Example screening results. Positive results (red bar to right of vertical line) indicates greater than average toxicity of cell line to tested agent. http://dtp.nci.nih.gov/docs/compare/compare.html
DTP and COMPARE • COMPARE is an algorithm for mining DTP result data to find and rank order compounds with similar DTP screening results. • Why COMPARE? • Discovered compounds may be less toxic to humans but just as effective against cancer cell lines. • May be much easier/safer to manufacture. • May be a guide to deeper understanding of experiments http://dtp.nci.nih.gov/docs/compare/compare_methodology.html
Many Other Online Databases • Complementary protein information • Indiana University: Varuna project • Discussed in this presentation • University of Michigan: Binding MOAD • “Mother of All Databases” • Largest curated database of protein-ligand complexes • Subset of protein databank • Prof. Heather Carlson • University of Michigan: PDBBind • Provides a collection of experimentally measured binding affinity data (Kd, Ki, and IC50) exclusively for the protein-ligand complexes available in the Protein Data Bank (PDB) • Dr. Shaomeng Wang
The Point Is… • All of these databases can be accessed on line with human-usable interfaces. • But that’s not so important for our purposes • More importantly, many of them are beginning to define Web Service interfaces that let other programs interact with them. • Plenty of tools and libraries can simulate browsers, so you can also build your own service. • This allows us to remotely analyze databases with clustering and other applications without modifying the databases themselves. • Can be combined with text mining tools and web robots to find out who else is working in the area.
Chemical Machine Languages • Interestingly, chemistry has defined three simple languages for encoding chemical information. • InChI, SMILES, CML • Can generate these by hand or automatically • InChIs and SMILES can represent molecules as a single string/character array. • Useful as keys for databases and for search queries in Google. • You can convert between SMILES and InChIs • OpenBabel, OELib, JOELib • CML is an XML format, and more verbose, but benefits from XML community tools
SMILES: Simplified Molecular Input Line Entry Specification • Language for describing the structure of chemical molecules using ASCII strings. http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
InChI: International Chemical Identifier • IUPAC and NIST Standard similar to SMILES • Encodes structural information about compounds • Based on open an standard and algorithms. http://wwmm.ch.cam.ac.uk/inchifaq/
InChI in Public Chemistry Databases • US National Institute of Standards and Technology (NIST) - 150,000 structures • NIH/NCBI/PubChem project - >3.2 million structures • Thomson ISI - 2+ million structures • US National Cancer Institute(NCI) Database - 23+ million structures • US Environmental Protection Agency(EPA)-DSSToX Database - 1450 structures • Kyoto Encyclopaedia of Genes and Genomes (KEGG) database - 9584 structures • University of California at San Francisco ZINC - >3.3 million structures • BRENDA enzyme information system (University of Cologne) - 36,000 structures • Chemical Entities of Biological Interest (ChEBI) database of the European Bioinformatics Institute - 5000 structures • University of California Carcinogenic Potency Project - 1447 structures • Compendium of Pesticide Common Names - 1437 (2005-03-03) structures
Journals and Software Using InChI • Journals • Nature Chemical Biology. • Beilstein Journal of Organic Chemistry • Software • ACD/Labs ACD/ChemSketch. • ChemAxon Marvin. • SciTegic Pipeline Pilot. • CACTVS Chemoinformatics Toolkit by Xemistry, GmbH. http://wwmm.ch.cam.ac.uk/inchifaq/
Chemistry Markup Language • CML is an XML markup language for encoding chemical information. • Developed by Peter Murray Rust, Henry Rzepa and others. • Actually dates from the SGML days before XML • More verbose than InChI and SMILES • But inherits XML schema, namespaces, parsers, XPATH, language binding tools like XML Beans, etc. • Not limited to structural information • Has OpenBabel support. http://cml.sourceforge.net/, http://cml.sourceforge.net/wiki/index.php/Main_Page
InChI Compared to SMILES • SMILES is proprietary and different algorithms can give different results. • Seven different unique SMILES for caffeine on Web sites: • [c]1([n+]([CH3])[c]([c]2([c]([n+]1[CH3])[n][cH][n+]2[CH3]))[O-])[O-] • CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12 • Cn1cnc2n(C)c(=O)n(C)c(=O)c12 • Cn1cnc2c1c(=O)n(C)c(=O)n2C • N1(C)C(=O)N(C)C2=C(C1=O)N(C)C=N2 • O=C1C2=C(N=CN2C)N(C(=O)N1C)C • CN1C=NC2=C1C(=O)N(C)C(=O)N2C On the other hand, some claim SMILES are more intuitive for human readers. http://wwmm.ch.cam.ac.uk/inchifaq/
A CML Example http://www.medicalcomputing.net/xml_biosciences.html
Clustering Techniques, Computing Requirements, and Clustering Services Computational techniques for organizing data
The Story So Far • We’ve discussed managing screening assay output as the key problem we face • Must sift through mountains of data in PubChem and DTP to find interesting compounds. • NIH funded High Throughput Screening will make this very important in the near future. • Need now a way to organize and analyze the data.
Clustering and Data Analysis • Clustering is a technique that can be applied to large data sets to find similarities • Popular technique in chemical informatics • Data sets are segmented into groups (clusters) in which members of the same cluster are similar to each other. • Clustering is distinct from classification, • There are no pre-determined characteristics used to define the membership of a cluster, • Although items in the same cluster are likely to have many characteristics in common. • Clustering can be applied to chemical structures, for example, in the screening of combinatorial or Markush compound libraries in the quest for new active pharmaceuticals. • We also note that these techniques are fairly primitive • More interesting clustering techniques exist but apparently are not well known by the chemical informatics community.
Non-Hierarchical Clustering • Clusters form around centroids. • The number of which can be specified by the user. • All clusters rank equally and there is no particular relationship between them. http://www.digitalchemistry.co.uk/prod_clustering.html
Hierarchical Clustering • Clusters are arranged in hierarchies • Smaller clusters are contained within larger ones; the bottom of the hierarchy consists of individual objects in "singleton" clusters, while the top of it consists of one cluster containing all the objects in the dataset. • Such hierarchies can be built either from the bottom up (agglomerative) or the top downwards (divisive) http://www.digitalchemistry.co.uk/prod_clustering.html
Fingerprinting and Dictionaries--What Is Your Parameter Space? • Clustering algorithms require a parameter space • Clusters defined along coordinate axes. • Coordinate axes defined by a dictionary of chemical structures. • Use binary on/off for fingerprinting a particular compound against a dictionary. http://www.digitalchemistry.co.uk/prod_fingerprint.html
Cluster Analysis and Chemical Informatics • Used for organizing datasets into chemical series, to build predictive models, or to select representative compounds • Clustering Methods • Jarvis-Patrick and variants • O(N2), single partition • Ward’s method • Hierarchical, regarded as best, but at least O(N2) • K-means • < O(N2), requires set no of clusters, a little “messy” • Sphere-exclusion (Butina) • Fast, simple, similar to JP • Kohonen network • Clusters arranged in 2D grid, ideal for visualization
Limitations of Ward’s method forlarge datasets (>1m) • Best algorithms have O(N2) time requirement (RNN) • Requires random access to fingerprints • hence substantial memory requirements (O(N)) • Problem of selection of best partition • can select desired number of clusters • Easily hit 4GB memory addressing limit on 32 bit machines • Approximately 2m compounds
Scaling up clustering methods • Parallelization • Clustering algorithms can be adapted for multiple processors • Some algorithms more appropriate than others for particular architectures • Ward’s has been parallelized for shared memory machines, but overhead considerable • New methods and algorithms • Divisive (“bisecting”) K-means method • Hierarchical Divisive • Approx. O(NlogN)
Divisive K-means Clustering • New hierarchical divisive method • Hierarchy built from top down, instead of bottom up • Divide complete dataset into two clusters • Continue dividing until all items are singletons • Each binary division done using K-means method • Originally proposed for document clustering • “Bisecting K-means” • Steinbach, Karypis and Kumar (Univ. Minnesota)http://www-users.cs.umn.edu/~karypis/publications/Papers/PDF/doccluster.pdf • Found to be more effective than agglomerative methods • Forms more uniformly-sized clusters at given level
BCI Divkmeans • Several options for detailed operation • Selection of next cluster for division • size, variance, diameter • affects selection of partitions from hierarchy, not shape of hierarchy • Options within each K-means division step • distance measure • choice of seeds • batch-mode or continuous update of centroids • termination criterion • Have developed parallel version for Linux clusters / grids in conjunction with BCI • For more information, see Barnard and Engels talks at: http://cisrg.shef.ac.uk/shef2004/conference.htm
Comparative execution timesNCI subsets, 2.2 GHz Intel Celeron processor 7h 27m 3h 06m 2h 25m 44m
Divisive K-means: Conclusions • Much faster than Ward’s, speed comparable to K-means, suitable for very large datasets (millions) • Time requirements approximately O(N log N) • Current implementation can cluster 1m compounds in under a week on a low-power desktop PC • Cluster 1m compounds in a few hours with a 4-node parallel Linux cluster • Better balance of cluster sizes than Wards or Kmeans • Visual inspection of clusters suggests better assembly of compound series than other methods • Better clustering of actives together than previously-studied methods • Memory requirements minimal • Experiments using AVIDD cluster and Teragrid forthcoming(50+ nodes)
Conclusions • Effective exploitation of large volumes and diverse sources of chemical information is a critical problem to solve, with a potential huge impact on the drug discovery process • Most information needs of chemists and drug discovery scientists are conceptually straightforward, but complex to implement • All of the technology is now in place to implement may of these information need “use-cases”: the four level model using service-oriented architectures together with smart clients look like a neat way of doing this • In conjunction with grid computing, rapid and effective organization and visualization of large chemical datasets is feasible in a web service environment • Some pieces are missing: • Chemical structure search of journals (wait for InChI) • Automated patent searching • Effective dataset organization • Effective interfaces, especially visualization of large numbers of 2D structures
Divisive K-Means as a Web Service • The previous exercise was intended to show that Divisive K-Means is a classic example of Grid application. • Needs to be parallelized • Should run on TeraGrid • How do you make this into a service? • We’ll go on a small tour before getting back to our problem.
Wrapping Science Applications as Services • Science Grid services typically must wrap legacy applications written in C or Fortran. • You must handle such problems as • Specifying several input and output files • These may need to be staged in • Launching executables and monitoring their progress. • Specifying environment variables • Often these have also shell scripts to do some miscellaneous tasks. • How do you convert this to WSDL? • Or (equivalently) how do you automatically generate the XML job description for WS-GRAM?