E-Chemistry and Web 2.0

E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University

NIH funded Chemical Informatics and Cyberinfrastructure Collaboratory (CICC) @ IU. Geoffrey Fox Gary Wiggins Rajarshi Guha David Wild Mookie Baik Kevin Gilbert And others Proposed Microsoft-Funded Project: E-Chemistry Carl Lagoze (Cornell), Lee Giles (PSU), Steve Bryant (NIH), Jeremy Frey (Soton), Peter Murray-Rust (Cambridge), Herbert Van de Sompel (Los Alamos), Geoffrey Fox (Indiana) And others One Talk, Two Projects

CICC Infrastructure Vision • Chemical Informatics: drug discovery and other academic chemistry, pharmacology, and bioinformatics research will be aided by powerful, modern, open, information technology. • NIH PubChem and PubMed provide unprecedented open, free data and information. • We need a corresponding open service architecture (i.e. avoid stove-piped applications) • CICCset up as distributed cyberinfrastructure in eScience model • Web clients(user interfaces) to distributed databases, results of high throughput screening instruments, results of computational chemical simulations and other analyses. • Composed of clients to open service APIs (mash-ups) • Aggregated into portals • Web services manipulate this data and are combined into workflows. • So our main agenda items: create interesting databases and build lots of Web services and clients.

CICC Databases • Most of our databases aim to add value to PubChem or link into PubChem • 1D (SMILES) and 2D structures • 3D structures (MMFF94) • Searchable by CID, SMARTS, 3D similarity • Docked ligands (FRED, Autodock) • 906K drug-like compounds into 7 ligands • Will eventually cover ~2000 targets • Philosophy: we have big computers, so let’s calculate everything ahead of time and put the results in a DB.

Building Up the Infrastructure • Our SOA philosophy: use standard Web services. • Mostly stateless • Some cluster, HPC work needed but these populate databases • Services are aggregate-able into different workflows. • Taverna, Pipeline Pilot, … • You can also build lots of Web clients. • See http://www.chembiogrid.org/wiki/index.php/CICC_Web_Resources for links and details. • Not so far from Web 2.0….

Sample Services

Web Client Interfaces

More Clients…

Example: PubDock • Database of approximately 1 million PubChem structures (the most drug-like) docked into proteins taken from the PDB • Available as a web service, so structures can be accessed in your own programs, or using workflow tools like Pipeline Polit • Several interfaces developed, including one based on Chimera (right) which integrates the database with the PDB to allow browsing of compounds in different targets, or different compounds in the same target • Can be used as a tool to help understand molecular basis of activity in cellular or image based assays

Example: R Statistics applied to PubChem data • By exposing the R statistical package, and the Chemistry Development Kit (CDK) toolkit as web services and integrating them with PubChem, we can quickly and easily perform statistical analysis and virtual screening of PubChem assay data • Predictive models for particular screens are exposed as web services, and can be used either as simple web tools or integrated into other applications • Example uses DTP Tumor Cell Line screens - a predictive model using Random Forests in R makes predictions of probability of activity across multiple cell lines.

A protein implicated in tumor growth with known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex) Example assay screening workflow: finding cell-protein relationships The screening data from a cellular HTS assay is similarity searched for compounds with similar 2D structures to the ligand. Docking results and activity patterns fed into R services for building of activity models and correlations LeastSquares Regression RandomForests NeuralNets Similar structures are filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein. Once docking is complete, the user visualizes the high-scoring docked structures in a portlet using the JMOL applet. Similar structures to the ligand can be browsed using client portlets.

Relevance to Web 2.0 • Some Web 2.0 Key Features • REST Services • Use of RSS/Atom feeds • Client interfaces are “mashups” • Gadgets, widgets for portals aggregate clients • So… • We provide RSS as an alternative WS format. • We have experimented with RSS feeds, using Yahoo Pipes to manipulate multiple feeds. • CICC Web interfaces can be easily wrapped as universal gadgets in iGoogle, Netvibes. • Alternative to classic science gateways.

RSS Feeds/REST Services • Provide access to DB's via RSS feeds • Feeds include 2D/3D structures in CML • Viewable in Bioclipse, Jmol as well as Sage etc. • Two feeds currently available • SynSearch – get structures based on full or partial chemical names • DockSearch – get best N structures for a target • Really hampered by size of DB and Postgres performance.

Tools and mashups based on web service infrastructure http://www.chembiogrid.org/projects/proj_tools.html

Mining information from journal articles • Until now SciFinder / CAS only chemistry-aware portal into journal information • We can access full text of journal articles online (with subscription) • ACS does not make full text available … but there are ways round that! • RSC is now marking up with SMILES and GO/Goldbook terms! • www.projectprospect.org • Having SMILES or InChI means that we can build a similarity/structure searchable database of papers: e.g. “find me all the papers published since 2000 which contain a structure with >90% similarity to this one” • In the absence of full text, we can at least use the abstract

Text Mining: OSCAR • A tool for shallow, chemistry-specific natural language parsing of chemical documents (e.g. journal articles). • It identifies (or attempts to identify): • Chemical names: singular nouns, plurals, verbs etc., also formulae and acronyms. • Chemical data: Spectra, melting/boiling point, yield etc. in experimental sections. • Other entities: Things like N(5)-C(3) and so on. • Part of the larger SciBorg effort • See http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html) • http://wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Oscar3

Create a database containing thetext of all recent PubMed abstracts(2006-2007 = ~500,000) Mash-Up: What published compounds might bind to this protein? Use OSCAR to extract all of the chemical names referred to in the abstracts and covert to SMILES DATABASE SERVICE + DOCKING SERVICE Convert molecules to 3D and dock into a protein of interest Visualize top docked molecules in a Google-like interface

E-Chemistry and Digital Libraries We can’t wait to get started….

E-Chemistry and Digital Libraries • Key problem with our SOA-based e-Science is information management. • Where is the service that I need? • What does it do? • We may consider our data-centric services to be digital libraries. • Data is diverse • Documents • Not just computational information like structures. • Another point of view: how can I link together publications, results, workflows, etc? • That is, I need to manage digital documents.

Digital Libraries • Open Archives Initiative Object Reuse and Exchange Project (OAI-ORE) • Developing standardized, interoperable, and machine-readable mechanisms to express information about compound information objects on the web. • Graph-based representations of connected digital objects. • Objects may be encoded in (for example) RDF or XML, • Retrievable via repositories with REST service interfaces (c.f. Atom Publishing Protocal) • Obtain, harvest, and register

Challenges for E-Chemistry • Can digital library principals be applied to data as well as documents? • Can you link your workflow to your conference paper? • Can we engineer a publishing framework and message formats around Web 2.0 principals? • REST, Atom Publishing Protocol, Atom Syndication Format, JSON, Microformats • Can we do this securely? • Access control, provenance, identify federation are key problems.

More Information • Project Web Site: www.chembiogrid.org • Project Wiki: www.chembiogrid.org/wiki • Contact me: mpierce@cs.indiana.edu

Chemical Informatics and Cyberinfrastucture Collaboratory Funded by the National Institutes of Health www.chembiogrid.org CICC CICC CICC Combines Grid Computing with Chemical Informatics Large Scale Computing Challenges Science and Cyberinfrastructure CICC is an NIH funded project to support chemical informatics needs of High Throughput Cancer Screening Centers. The NIH is creating a data deluge of publicly available data on potential new drugs. Chemical Informatics is non-traditional area of high performance computing, but many new, challenging problems may be investigated. NIH PubMed DataBase OSCAR Text Analysis Cluster Grouping Toxicity Filtering Docking . Initial 3D Structure Calculation OSCAR-mined molecular signatures can be clustered, filtered for toxicity, and docked onto larger proteins. These are classic “pleasingly parallel” tasks. Top-ranking docked molecules can be further examined for drug potential. Chemical informatics text analysis programs can process 100,000’s of abstracts of online journal articles to extract chemical signatures of potential drugs. Molecular Mechanics Calculations Big Red (and the TeraGrid) will also enable us to perform time consuming, multi-stepped Quantum Chemistry calculations on all of PubMed. Results go back to public databases that are freely accessible by the scientific community. • CICC supports the NIH mission by combining state of the art chemical informatics techniques with • World class high performance computing • National-scale computing resources (TeraGrid) • Internet-standard web services • International activities for service orchestration • Open distributed computing infrastructure for scientists world wide NIH PubChem DataBase Quantum Mechanics Calculations IU’s Varuna DataBase POVRay Parallel Rendering Indiana University Department of Chemistry, School of Informatics, and Pervasive Technology Laboratories

MLSCN Post-HTS Biology Decision Support Percent Inhibition or IC50 data is retrieved from HTS Grids can link data analysis ( e.g image processing developed in existing Grids), traditional Chem-informatics tools, as well as annotation tools (Semantic Web, del.icio.us) and enhance lead ID and SAR analysis A Grid of Grids linking collections of services atPubChem ECCR centers MLSCN centers Workflows encoding plate & control well statistics, distribution analysis, etc Question: Was this screen successful? Workflows encoding distribution analysis of screening results Question: What should the active/inactive cutoffs be? Question: What can we learn about the target protein or cell line from this screen? Workflows encoding statistical comparison of results to similar screens, docking of compounds into proteins to correlate binding, with activity, literature search of active compounds, etc Compounds submitted to PubChem PROCESS CHEMINFORMATICS GRIDS

R Web Services

Why? • Need access to math and stat functionality • Did not want to recode algorithms • Wanted latest methods • Needed a distributed approach to computation • Keep computation on a powerful machine • Access it from a smaller machine

Why R? • Free, open-source • Many cutting edge methods avilable • Flexible programming language • Interfaces with many languages • Python • Perl • Java • C

The R Server • R can be run as a remote compute server • Requires the rserve package • Allows authenticated access over TCP/IP • Connections can maintain state • Client libraries for Java & C

R as a Web Service • On its own the R server is not a web service • We provide Java frontends to specific functionalities • The frontend classes are hosted in a Tomcat web container • Accessible via SOAP • Full Javadocs for all available WS’s

Flowchart

Functionality • Two classes of functionality • General functions • Allows you to supply data and build a predictive model • Sample from various distributions • Obtain scatter plots and hisotgram • Model development functions use a Java front-end to encapsulate model specific information

Functionality • Two classes of functionality • Model deployment • Allows you to build a model outside of the infrastructure • Place the final model in the infrastructure • Becomes available as a web service • Each model deployed requires its own front end class • In general, these classes are identical - could be autogenerated

Available Functionality • Predictive models - OLS, RF, CNN, LDA • Clustering - k-means • Statistical distributions • XY plot and scatter plots • Model deployment for single model types and ensemble model types

Deployed Models • Since deployed models are visible as web services we can build a simple web front end for them • Examples • NCI anti-cancer predictions • Ames mutagenicity predictions

Applications • The R WS is not restricted to ‘atomic’ functionality • Can write a whole R program • Load it on the R compute server • Provide a Java WS frontend • Examples • Feature selection • Automated model generation • Pharmacokinetic parameter calculation

Data Input/Output • Most modeling applications require data matrices • Depending on client language we can use • SOAP array of arrays (2D matrices) • SOAP array (1D vector form of a 2D matrix) • VOTables

Data Input/Output • Some R web services can take a URL to a VOTables document • Conversion to R or Java matrices is done by a local VOTables Java library • R also has basic support for VOTables directly • Ignores binary data streams

Interacting With R WS’s • Traditional WS’s do not maintain state • Predictive models are different • A model is built at one time • May be used for prediction at another time • Need to maintain state • State is maintained by serialization to R binary files on the compute server • Clients deal with model ID’s

Interacting with R WS’s • Protocol • Send data to model WS • Get back model ID • Get various information via model ID • Fitted values • Training statistics • New predictions

Cheminformatics at Indiana University School of Informatics David J. Wild djwild@indiana.edu Associate Director of Chemical Informatics & Assistant Professor Indiana University School of Informatics, Bloomington http://djwild.info

Cheminformatics education at Indiana • M.S. in Chemical Informatics • 2 years, 36 semester hours • Includes a 6-hour capstone / research project • Opportunity to work in Laboratory Informatics (IUPUI) or closely with Bioinformatics (IUB) • Currently 9 students enrolled • Ph.D. in Informatics, Cheminformatics Specialty • 90 credit hours, including 30 hours dissertation research. Usually 4 years. • Research rotations expose students to research in related areas • Currently 4 students enrolled • Graduate Certificate • 4 courses, all available by Distance Education • I571 Chemical Information Technology • I572 Computational Chemistry & Molecular Modeling • I573 Programming for Science Informatics • I553 Independent Study in Chemical Informatics • D.E. students pay in-state fees! (~$800 per class) • See http://cheminfo.informatics.indiana.edu for more information, or a general review of cheminformatics education in Drug Discovery Today 11, 9&10 (May 2006), pp436-439

E-Chemistry and Web 2.0