400 likes | 531 Views
NeSC Data Projects and Initiatives. Dr. Dave Berry Research Manager. Contents. The Data Deluge Web Services The DAI vision The OGSA-DAI Project and GGF The OGSA-DAI Software Edikt Other relevant projects in the UK. Acknowledgements. This talk includes material prepared by:
E N D
NeSC Data Projects and Initiatives Dr. Dave Berry Research Manager
Contents • The Data Deluge • Web Services • The DAI vision • The OGSA-DAI Project and GGF • The OGSA-DAI Software • Edikt • Other relevant projects in the UK
Acknowledgements This talk includes material prepared by: • The OGSA-DAI project • The e-Diamond project • The BRIDGES project • The GGF OGSA Working Group • and others…
The Data Deluge • Entering an age of data • CERN: LHC will generate 1GB/s = 10PB/y • VLBA (NRAO) generates 1GB/s today • Pixar generate 100 TB/Movie • Data stored in many different ways • Relational databases • XML databases • Flat files • Need ways to facilitate • Data discovery • Data access • Data integration Mont Blanc (4810 m) Downtown Geneva
Data and images courtesy Alex Szalay, John Hopkins Astronomical Databases • No. & sizes of data sets as of mid-2002, grouped by wavelength • 12 waveband coverage of large areas of the sky • Total about 200 TB data • Doubling every 12 months • Largest catalogues nr. 1B objects
Bioinformatics Databases PDB Content Growth • Biobliographic (MedLine, …) • Amino Acid Seq (SWISS-PROT, …) • 3D Molecular Structure (PDB, …) • Nucleotide Seq (GenBank, EMBL, …) • Biochemical Pathways (KEGG, WIT…) • Molecular Classifications (SCOP, CATH,…) • Motif Libraries (PROSITE, Blocks, …)
Web Services • Using the protocols and ideas that have made the web a success for humans… • And applying them to distributed programming • HTTP • Single networking port • Autonomy & Failure handling • Open standards • Tools & Platforms • Apache axis • Websphere, .NET, Oracle Application Server, Sun ONE, …
Open Grid Services Architecture Share resource Access resource Manage resource Continuous Availability Applications on demand Resources on demand Secure and universal access Global Accessibility Business integration Vast resource scalability Web Services Grid Protocols The architecture of the Global Grid Forum
GGF11: OGSA specification informational document Cataloging Provisioning VO Mgmt Integration Policy Mgmt Access Context Services Information Services Data Services Trouble- shooting Event Mgmt Discovery Logging Execution Mgmt Services Infrastructure Services Application Mgmt Workflow Mgmt Workload Mgmt Execution Planning Job Mgmt WSRF WSN WSDM Naming Self Mgmt Services Resource Mgmt Services Reservation Configuration Deployment Provisioning Security Services Heterogeneity Mgmt Authentication Optimization Authorization Service Level Attainment Integrity QoS Mgmt Boundary Traversal
Data Access and Integration • Web Services for querying and integrating structured data resources • The foundation framework for: • Building tailored DAI applications • Higher-level services: • Replication: Data located in multiple locations • Federation: Composition of multiple sources • Provenance: How was data generated?
Powered by …. The OGSA-DAI Project • Funded by the Grid Core Programme • OGSA-DAI • £3 million, 18 months, from Feb 2002 • Three major releases, three interim releases • DAIT (DAI-Two) • Keep the OGSA-DAI brand name • £1.5 million, 24 months, • from Oct 2003 • Four major releases
DAI in GGF and OGSA • Data Access and Integration Services WG • Strong involvement from OGSA-DAI members • Standardise the interfaces – WS-DAI • OGSA-DAI a reference implementation • Experience informing specification work • OGSA WG Data Design Team • Designing the data-oriented aspects of OGSA • Created after GGF10 (March 2004) • Led by NeSC
Cataloging Provisioning VO Mgmt Integration Policy Mgmt Access Context Services Info Services Data Services Trouble- shooting Event Mgmt Discovery Logging Execution Mgmt Services Infra Services Application Mgmt Workflow Mgmt Workload Mgmt Execution Planning Job Mgmt WSRF WSN WSDM Naming Self Mgmt Services Rsrc Mgmt Services Reservation Configuration Deployment Provisioning Security Services Heterogeneity Mgmt Authentication Optimization Authorization Service Level Attainment Integrity QoS Mgmt Boundary Traversal OGSA Design Teams Data Service design team Information Service design team EMS design team Naming design team OGSA-WG Self Mgmt design team Resource Mgmt design team Security Service design team Core (roadmap) design team
Data Services design team • Informal domain expert groups within OGSA • May include co-chairs of other WG/RGs • Output is included in OGSA specification DAIS-WG OGSA Data Service Design team GSM-WG GFS-WG OGSA-WG Tele cons, F2F meetings Info-D WG ADF, OREP, …
OGSA v2 Document Deliverables Root Documents Glossary Usecase doc Architecture v2 Design team Documents Service descriptions Scenarios Working Group Specifications GGF Recommendation documents
How OGSA-DAI works 1a. Request to Registry for sources of data about “x” SOAP/HTTP service creation API interactions Registry 1b. Registry responds with Factory handle 2a. Request to Factory for access to database Factory Client 2c. Factory returns handle of GDS to client 2b. Factory creates GridDataService to manage access 3a. Client queries GDS with XPath, SQL, etc XML / Relational database Grid Data Service 3c. Results of query returned to client as XML 3b. GDS interacts with database
OGSA-DAI compared to JDBC • Language independence at the client end • Platform independence • Do not have to worry about connection technology, drivers, etc • Can handle XML resources • Can embed additional functionality at the service end • Transformations • Third party delivery • Avoiding unnecessary data movement • Provision of Metadata is powerful • Usefulness of the Registry for service discovery • Dynamic service binding process
SOAP/HTTP service creation API interactions Application Code Future DAI Services 1a. Request to Registry for sources of data about “x” & Data “y” Registry 1b. Registry responds with Factory handle 2a. Request to Factory for access and integration from resources Sx and Sy Data Access & Integrationmaster 2c. Factory returns handle of GDS to client 3b. Client 2b. Factory creates tells GridDataServices network analyst Client 3a. Client submits sequence of scripts each has a set of queries GDTS to GDS with XPath, SQL, etc 1 XML Analyst GDS GDTS database GDS 2 S x GDS S 3c. Sequences of result sets returned to y Relational analyst as formatted binary described in GDTS GDS GDS 2 3 a standard XML notation 1 database GDS GDTS
Activities are the drivers • Express a task to be performed by a GDS • Three broad classes of activities: • Statement • Transformations • Delivery • Extensible: • Easy to add new functionality • Does not require modification to the service interface • Extension operate within the OGSA-DAI framework • Functionality: • Implemented at the service • Work where the data is (do not require to move data back)
Building Applications • Activities are grouped together • Perform document • Data can flow between activities • Optimisation • Avoids multiple message exchanges • Can deliver to other GDSs • Prerequisite for data integration • Base middleware for projects requiring data access • Some capability for data integration
Release 4, April 2004 • Provides Data Access components, an extensible framework for building applications and some integration components • Built on top of Globus Toolkit 3.2 • Supports relational, xml and some files • MySQL, Oracle, DB2, SQL Server, Postgres, XIndice, CSV • Supports various delivery options • SOAP, FTP, GridFTP, HTTP, files, email, inter-service • Supports various transforms • XSLT, ZIP, GZip • Supports message level security using X509 certificates • Client Toolkit library for application developers • GUI data browser (contributed by FirstDIG project) • Separate Distributed Query Processing components • Comprehensive documentation and tutorials in XHTML format
Downloads by Release 2746 downloads (~4.7 downloads a day)
Downloads by country 792 registered users @ 23/8/04
Release 5, October 2004 • Re-engineered interface-independent core OGSA-DAI functionality. • Improved dependability and security integration. • New file data resources representing flat files queried using full text searches (e.g. EMBL format). • Installation and Configuration Wizard, including “all-in-one installer” • Improved Data Browser which allows XPath querying. • Set of standard benchmarks. • JSP Quick View interface. • Support for other databases (e.g. Access, Exist, HSQL).
Release 6, April 2006 • Data Integration applications supporting identified scenarios • OGSA-DQP as an integrated part of release • Fully compliant JDBC Driver for OGSA-DAI • Support for WS-Security implementations • Support for stored procedures on all supported databases • Improved support for different database specific SQL types • SQL translation between vendor dialects for subset of queries • Support for XQuery data resources • We expect to comply with a version of the emerging DAIS specification at this release.
Who is Using OGSA-DAI? N2Grid (http://www.cs.univie.ac.at/institute/index.html?project-80=80) Bridges (http://www.brc.dcs.gla.ac.uk/projects/bridges/) BioSimGrid (http://www.biosimgrid.org/) INWA (http://www.epcc.ed.ac.uk/projects/inwa/) BioGrid (http://www.biogrid.jp/) AstroGrid (http://www.astrogrid.org/) eDiaMoND (http://www.ediamond.ox.ac.uk/) OGSA-DAI (http://www.ogsadai.org.uk) GEON (http://www.geongrid.org/) myGrid (http://www.mygrid.org.uk/) MCS (http://www.isi.edu/~deelman/MCS/) ODD-Genes (http://www.epcc.ed.ac.uk/oddgenes/) OGSA-WebDB (http://www.gtrc.aist.go.jp/dbgrid/) GridMiner (http://www.gridminer.org/) FirstDig (http://www.epcc.ed.ac.uk/~firstdig/) GeneGrid (http://www.qub.ac.uk/escience/projects.php#genegrid) IU RGRBench (http://www.cs.indiana.edu/~plale/projects/RGR/OGSA-DAI.html)
Standards E-Science Apps CS Research Grid Services fore-Science Data Management Commercial SW componentsand skills Edikt • The team: 8 professional software engineers, support staff, project manager, commercialisation manager, architect, and SAB • SHEFC funded research and development grant • 3 years funding: May 2002 – 2005 • +3 years funding upon successful project and review Requirementsanalysis Technologymatchmaking Edikt project Gap filling Rigorousengineering
Web User1 Grid Proxy Web Servlet DAC DAC DAC DAC ELDAS – Data Access Service Grid User1 Grid User2 • Implemented using Enterprise Java Beans • Data Access Components interface to distinct DBMSs • Accessible as a grid data service or a web data service JavaFramework Another (partial) implementation of the GGF WS-DAI specifications ELDAS EJB - DAS Xindice DB MySQL DB DB2 DB Oracle 9i DB
BinX file describes binary file structure BinX – accessing legacy binary data simulations • The Problem: • Many binary data files • Applications must “know”the data format • Binary data formats are machine-specific BinaryData File BinaryData File BinaryData File • The Solution: • Write a “stand-aside” format description in XML • Provide a library to • Interpret the description • Provide file access across different machines • Build higher-level services BinX Library e-ScienceApplication
Mammography A prototype of a national database of mammographic images in support of the UK breast screening programme Temporal mammography Computer Aided Detection Standard Mammo Format Mammograms have different appearances, depending on image settings and acquisition systems 3D View
CHU KCL UED UCL Training Application Data Load Training App Data Load Training App Data Load Training App Data Load Training App Core API Training API Core & Training API Core & Training API Core & Training API Core & Training API Training Services Core Services Core Services Core Services Core Services Content Manager Content Manager Content Manager Content Manager DB2 DB2 DB2 DB2 OGSA-DAI OGSA-DAI OGSA-DAI OGSA-DAI OGSA-DAI OGSA-DAI DB2 Federation Files Database
The BRIDGES Project • Biomedical Research Informatics Delivered by Grid Enabled Services • NeSC (Edinburgh and Glasgow) and IBM • www.brc.dcs.gla.ac.uk/projects/bridges • Supporting project for CFG project • Generating data on hypertension • Rat, Mouse, Human genome databases • Variety of tools used • BLAST, BLAT, Gene Prediction, visualisation, … • Variety of data sources and formats • Microarray data, genome DBs, project partner research data, medical records, … • Aim is integrated infrastructure supporting • Data federation • Security
Information Integrator OGSA-DAI SyntenyGrid Service blast + BRIDGES VO Authorisation
INWA Project • Innovation Node Western Australia • Informing Business & Regional Policy: Grid-enabled fusion of global data and local knowledge • Involved 10 partners (6 UK + 4 Australia) • Aim • Data mine commercially sensitive data • Security an absolute MUST • Employ Grid technologies • Need access to data and computational resources • OGSA-DAI • Access data resources • SunDCG's TOG (Transfer-queue Over Globus) • Handle job submission to analyse micro array data
TOG EPCC,UK user@australia OGSA-DAI OGSA-DAI Bank data UK Property Grid Engine Grid Engine TOG Curtin,Australia Bank Bank Telco Telco user@edinburgh OGSA-DAI OGSA-DAI Telco data Australian property Data Browser Data Browser INWA
Further Information on OGSA-DAI • The OGSA-DAI Project Site: • http://www.ogsadai.org.uk • The DAIS-WG site: • http://cs.man.ac.uk/grid-db • OGSA-DAI Users Mailing list • users@ogsadai.org.uk • General discussion on grid DAI matters • Formal support for OGSA-DAI releases • http://www.ogsadai.org.uk/support • support@ogsadai.org.uk • OGSA-DAI training courses