Patrick McConnell Duke Comprehensive Cancer Center patrick.mcconnell@duke Shannon Hastings

caGrid Version 0.5 Reference ImplementationRProteomicscaBIG Architecture Workspace Face to FaceGeorgetown UniversityAugust 16th -18th, 2005 Patrick McConnell Duke Comprehensive Cancer Center patrick.mcconnell@duke.edu Shannon Hastings Ohio State University hastings@bmi.osu.edu

Outline • High Level Overview of Proteomics • Data Model • Project Architecture • Process of getting to “Silver” level compliance • Functionality Exposed to Grid • Process of Grid Enablement • Demo/Screenshots • Lessons Learned / Technical Difficulties / Wish List • Acknowledgements

Proteomics Overview • Goal • Find biomarker • Build predictive model • Proteins are split into peptide fragments • Mass is measured by time-of-flight (TOF) • Mass of peptides can be used to identify proteins • Peptides can undergo a second MS to help identification http://www.appliedbiosystems.com/catalog/myab/StoreCatalog/products/CategoryDetails.jsp?hierarchyID=101&category3rd=112051&trail=no

Time Proteomics Data • A modest study can be on the order of 10 GB of data

Project Overview • RProteomics is a development project in the Proteomics SIG of the ICR Workspace • Developing analytical routines for proteomics data • Denoising, background removal, peak identification, spectral alignment, normalization, peptide quantitation • Focus is on analytics • NOT databases, LIMS, protein identification • RProteomics is a critical step in the proteomics pipeline • LIMS -> repository -> RProteomics -> classification -> protein identification • RProteomics provides integration • Q5 classification has been integrated

Statistics: Background Removal

Statistics: Denoising

Statistics: Spectral Alignment

Statistics: Protein Quantitation

Data Model • mzXML • Encodes raw spectra data (mz-intensity pairs) • Some metadata about instrumentation • Utilizes base64 encoding for binary data • scanFeatures • Encodes analysis results as a set of features • Some metadata about the experiment • Utilizes base64 encoding for binary data • Service parameters • JpegImage (GWSDL=scanFeatures.xsd, ) • Lsid (GWSDL=scanFeatures.xsd, ) • WindowSize (GWSDL=scanFeatures.xsd, ) • ThreshholdMultiplier (GWSDL=scanFeatures.xsd, )

Project Architecture

Process of getting to “Silver” level compliance • Programming and messaging interfaces • Apache Axis for web services • Wrapped functionality with Java interfaces that “made sense” • Vocabularies, terminologies, and ontologies • Data elements • Wrote tool for XML Schema to XMI conversion • Manually curated UML • Went through semantic connecting process • Information models • XML Schema to begin with, so information models were easy

Functionality Exposed to the Grid • Analytical service: no security requirements • Discuss its input and output and what it does scientifically • Functionality to be exposed: • 20+ more statistical methods • Data access methods, translation methods (planned, not yet in scope)

Process of Grid Enablement • Process • Creation/extraction of data types using XML Schema • Upload data types into caGrid GME • Use the Analytical Toolkit Portal to create and modify grid service interface. • Implement the server stub that is generated by making the appropriate calls into the original non-grid-enabled RProteomics application. • Compile, and deploy.

Demo and/or Screenshots • Demonstration of RProteomics GUI with grid functionality

Lessons Learned / Technical Difficulties / Wish List • Think grid from the beginning • Have an idea what the service interface will be ahead of time • Wrap parameters with objects • Technology is complex • XML, Schema, CDEs, Globus, Web Services, etc. • Installation is complex • Have to have working knowledge of Tomcat, Axis, Ant, environment variables, etc. • Need to have compatible versions of each component, esp. Java 1.4.2_04 • Wish list • Wizard for grid-enabling existing code • Documentation of every aspect of installation and functionality • Clone Shannon for each development project

Lessons Learned / Technical Difficulties / Wish List • Starting with a non-grid-enabled application which has been tested and is stable made wrapping it to a grid service easier to debug. • Need a standard mechanism for dealing with large data objects. • Some sort of lazy loaded object/pointer would be sufficient. • Integration of toolkit portal into some standard IDE’s might make development even easier.

Duke, ICR Developer Patrick McConnell, Project lead Richard Haney, Architect and developer of statistical systems Salvatore Mungal, Middle-tier Java developer Mark Peedin, Database developer Northwestern University, Collaborator Simon Lin, Proteomics domain expert Oregon Health Sciences University, ICR Adopter Shannon McWeeney Veena Rajaraman University of Pennsylvania, ICR Adopter David Fenstermacher Craig Street University of North Carolina, Collaborator Cristoph Borchers, Proteomics scientist OSU, caGRID Team Shannon Hastings Scott Oster Stephen Langella Tahsin Kurc Joel Saltz Architecture Arumani Manisundaram Avinash Krishnakant VCDE Brian Davis, Workspace Lead George Komatsoulis, VCDE lead Claire Wolfe, VCDE curator Salvatore Mungal, VCDE mentor Acknowledgements

Patrick McConnell Duke Comprehensive Cancer Center patrick.mcconnell@duke Shannon Hastings

Patrick McConnell Duke Comprehensive Cancer Center patrick.mcconnell@duke Shannon Hastings

Presentation Transcript

Duke Ellington

Duke Center for Community Research

DUKE iGEM

Duke Ellington

Duke Ellington

Duke TiP

Duke University Medical Center

Duke WasteTracking

Duke Ellington

DUKE RAZORFISH

Duke Ellington

Duke Ellington

Duke Univ

Duke University

Duke Jinesh

Duke Center for Community Research

Duke

Duke University

Duke University, Duke University Medical Center and Duke University Health System

Duke Ellington

Duke TiP

Duke Construction