1 / 29

Enhancing Cheminformatics Research Through Cyberinfrastructure

Explore the intersection of biology, chemistry, and computer science in cheminformatics research using cutting-edge cyberinfrastructure technologies. Learn about grid computing, web services, and collaboration tools shaping the future of cheminformatics.

franceslong
Download Presentation

Enhancing Cheminformatics Research Through Cyberinfrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory Centers for Cheminformatics Research (ECCR): Talk I July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 gcf@indiana.edu http://www.infomall.org http://www.chembiogrid.org With apologies for my credentials. I have written a few papers on Biology, Chemistry and Crystallography while at Cambridge, Caltech and Syracuse Mostly on applications of parallel computing

  2. Start-up and Organization • Local Teams, successful Prototypes and International Collaboration set up in 3 major focus areas • “Tool and Data” Cyberinfrastructure • “Archival Database and Simulation” Cyberinfrastructure • Education • Wiki chosen to support project as a shared editable web space • Web site http://www.chembiogrid.org • Building Collaboratory involving PubChem – Global Information System accessible anywhere and at any time – enhance PubChem with distributed tools (clustering, simulation, annotation etc.) and data • Initial results discussed at conferences/workshops/papers • Gordon Conferences, ACS, SDSC tutorial • First new Cheminformatics courses offered • Advisory board set up and met • Videoconferencing-based meetings with Peter Murray-Rust and group at Cambridge roughly every 2-3 weeks • Good interactions with NIH DTP, Lilly and Michigan ECCR

  3. http://www.chembiogrid.org

  4. CICC Senior Personnel • Peter T. Cherbas • Mehmet M. Dalkilic • Charles H. Davis • A. Keith Dunker • Kelsey M. Forsythe • Kevin E. Gilbert • John C. Huffman • Malika Mahoui • Daniel J. Mindiola • Santiago D. Schnell • William Scott • Craig A. Stewart • David R. Williams • Geoffrey C. Fox • Mu-Hyun (Mookie) Baik • Dennis B. Gannon • Marlon Pierce • Beth A. Plale • Gary D. Wiggins • David J. Wild • Yuqing (Melanie) Wu From Biology, Chemistry, Computer Science, Informatics at IU Bloomington and IUPUI (Indianapolis)

  5. CICC Advisory Board • Alan D. Palkowitz (Eli Lilly) • Andrew Martin (Kalypsys) • David Spellmeyer (IBM) • Dimitris K. Agrafiotis (Johnson & Johnson) • Horst Hemmerle (Eli Lilly) • James M. Caruthers (Purdue University) • Jeremy G. Frey (University of Southampton) • Joel Saltz (Ohio State University/University of Maryland/Johns Hopkins University) • John M. Barnard (Digital Chemistry) • John Reynders (Eli Lilly) • Peter Murray-Rust (University of Cambridge) • Peter Willett (University of Sheffield) • Thompson Doman (Eli Lilly) • Val Gillet (University of Sheffield) Industry andAcademia Met October 2005 will meet this fall

  6. Publications Baik says he is especially productive due to Cyberinfrastructure

  7. Our Meetings are on the Web

  8. Varuna environment for molecular modeling (Baik, IU) Chemical Concepts Researcher Papers etc. Experiments ChemBioGrid Simulation ServiceFORTRAN Code, Scripts DB ServiceQueries, Clustering,Curation, etc. ReactionDB QM Database Condor PubChem, PDB,NCI, etc. QM/MM Database TeraGridSupercomputers“Flocks”

  9. Cyberinfrastructure and Grids • These support eScience or distributed Computers, Databases, Instruments, Sensors and People • Grids use large scale managed Web services – the current major technology building on modern Industry enterprise and Internet systems • W3C, OASIS, OGF or Open Grid Forum (Fox VP for eScience) develops standards insuring distributed resources interoperate • Cheminformatics benefits from 2 styles of Grids • TeraGrid typifies Grid support of large scale computation of parallel simulations • Bioinformatics (BIRN, caBIG, MyGrid …), Earth Science and Astronomy Grids illustrate integration of real-time and archival data(bases) and computation • Well designed Grids run faster than older approaches

  10. Cheminformatics Grids Need • Broad System standards such as WSDL, SOAP, WSRM, JSDL, BPEL • Domain specific data structures • CML Cheminformatics • GML Earth Science • CellML, SBML Biology • VOQL Astronomy • Use of specific Grid/Web service technologies such as • Web services directly for tools • Web service proxies for large simulation codes – ANYTHING can be made a Web service efficiently if execution/network access time ≥ 20ms • Portals/Portlets for user interfaces • Workflow for composition • Access to data and compute resources

  11. TeraGrid: Integrating NSF Cyberinfrastructure Buffalo Wisc UC/ANL Cornell Utah Iowa PU NCAR PSC IU NCSA Caltech ORNL USC-ISI UNC-RENCI SDSC TACC TeraGrid is a facility that integrates computational, information, and analysis resources at the San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications, Purdue University, Indiana University, Oak Ridge National Laboratory, the Pittsburgh Supercomputing Center, and the National Center for Atmospheric Research.

  12. Indiana University has Highest PerformanceU.S. Academic Computer System 20 Teraflops peak Top500Supercomputersin the world

  13. Products and Demonstrations www.chembiogrid. org Note mixture ofIn-house Out of House CommercialAcademic

  14. Next steps? • Define WSDL interfaces to enable global production of compatible Web services; refine CML • Ready to try “Prototype Production” • Develop more training material • Refine/go into production with key services including both tools, workflows and TeraGrid style simulations in capacity and capability modes • In-house algorithm work for new services in clustering, diversity analysis, QSAR methodologies CICC Prototype Web Services Basic cheminformatics Key Ideas Molecular weights Molecular formulae Tanimoto similarity 2D Structure diagrams Molecular descriptors 3D structures InChi generation/search CMLRSS • Add value to PubChem with additional distributed services and databases • Wrapping existing code in web services is not difficult • Provide “core” (CDK) services and exemplars of typical tools • Provide access to key databases via a web service interface • Provide access to major Compute Grids Application based services Compare (NIH) Toxicity predictions (ToxTree) Literature extraction (OSCAR3) Clustering (BCI Toolkit) Docking, filtering, ... (OpenEye)Varuna simulation

  15. Web Service Locations Cambridge University • InChi generation / search • CMLRSS • OpenBabel Indiana University • Clustering • VOTables • OSCAR3 • Toxicity classification • Database services SDSCTypical TeraGrid Site InfoChem • SPRESI database NIH PubChem ….. Compare ….. Penn State University CDK based services • Fingerprints • Similarity calculations • 2D structure diagrams • Molecular descriptors

  16. Usage of Open Source Projects • A number of open source projects are used in our infrastructure • CDK provides the underlying cheminformatics toolkit • R provides the back-end modeling capabilities • OSCAR is used for literature mining • ToxTree is used to provide toxicity classification • Open data and standards as promoted by the Blue Obelisk project

  17. Contributions to Open Source Projects • We also contribute functionality to these projects • Molecular descriptor development to the CDK • Modifications of various CDK functionality to make them suitable for web service usage • Infrastructure for accessing R from the CDK • Packages to use the CDK from within R • Quality control, testing and documentation Steinbeck, C. et al.; Curr. Pharm. Des., 2006, 12(17), 2110-2120 Guha, R.; CDK News, 2005, 2(1), 7-13

  18. Workflows Using Chemical Literature Find similar documents Bulk download of Pubmed abstracts Find similar molecules All of PubMed “just” takes about a day to run through OSCAR3 on 2048 node Big Red PDBBind OSCAR3 Service OSCAR3 program PubChem Local DTP database Extract chemical structures SMILES NAME Pubmed ID CCC propane 1425356 CC ethane 3546453 ..... ............. ............. Searchable (structure/similarity) Grid database Clustering of documents linked to clustering of chemicals

  19. MyResearchDatabase Bibliographic Database Web serviceWrappers Document-enhanced Cyberinfrastructure Del.icio.us Windows Live Academic Search TraditionalCyberinfrastructure Export:RSS, BibtexEndnote etc. CiteULike Google Scholar Connotea Citeseer Bibsonomy Science.gov Biolicious PubChem Generic Document Tools CMT ConferenceManagement PubMed Community Tools Manuscript Central Integration/Enhancement User Interface etc. Existing User Interface New Document-enhanced Research Tools including Web2.0, Mashups, Annotation Existing Document-basedResearch Tools

  20. Products and Demonstrations II

  21. Example HTS workflow: organization & flagging A biological screen is selected. The activity results for all the compounds is extracted from the database (currently using DTP Tumor Cell Line database) OpenEye FILTER is used to calculate biological and chemical properties of the compounds that are related to their potential effectiveness as drugs Taverna Workflow The compounds are clustered on chemical structure similarity, to group similar compounds together The compounds along with property and cluster information are converted to VOTABLES format and displayed in VOPLOT David Wild – Research Overview July 2006. Page 21

  22. Run Workflow Load Workflow Result Output URL Result Output Current Process

  23. Lilly very interested in our new educational programs

  24. Total Grad Enrollment: Chem-, Lab, Bio-, Health Informatics, Fall 2005Red = Expected, Chem, Fall 2006

  25. Formal Cheminformatics Courses • I571 Chemical Information Technology (3 cr.) • Distance Ed section had 10 students in Fall 2005, from California to Connecticut • I572 Computational Chemistry and Molecular Modeling (3 cr.) • I573 Programming Techniques for Chemical and Life Science Informatics (3 cr.) • I553 Independent Study in Chemical Informatics (3 cr.) • Above courses required for the new Graduate Certificate Program in Chemical Informatics • Also I533 (Cheminformatics seminar)

  26. More detailed Slides not used

  27. TeraGrid Hardware and Software • TeraGrid is coordinated at the University of Chicago and includes 8 partner facilities • NCSA, SDSC, PSC, ORNL, IU, PU, TACC, UC/ANL • TeraGrid hardware totals > 102 teraflops of computing power. • Comprehensive information available from http://www.teragrid.org/userinfo/hardware/overview.php. • Systems are primarily Linux clusters. • Grid software and services (Globus, MyProxy, etc) provide a uniform means for accessing TeraGrid resources. • Scheduling, running and monitoring jobs • Monitoring resources • Moving and managing remote files. • Common service APIs simplify the process for building remote tools.

  28. Web Service to generate custom force fields Prototype CICC Project: Controlling the TGFb pathwayCollaboration between Baik & Zhang at IU Simulations in-house Molecules in Varuna AutoGeFF VARUNA Conceptual Understanding of TGFb Inhibition Inactive TGFb Active TGFb With inhibitor 1IAS • Questions: • - What molecular feature controls inhibitor binding? • - How do mutations impact binding? PubChem Experimentsin the Zhang Lab PDB

  29. MLSCN Data - How services and workflows are used PubChem interfaces to workflows via SOAP Data is stored in Pubchem MLSCN submits HTS data to Pubchem and/or sends directly to workflow for real-time feedback Workflows perform different kinds of analysis on the MLSCN data - the variety of workflows is limitless End-user applications and interfaces utilize the information streams from the workflows for human interaction with the data and analysis

More Related