120 likes | 207 Views
Grid-Enabling Data: Sticking Plaster, Sellotape, & Chewing Gum ?. Colin C. Venters c.venters@ncess.ac.uk National Centre for e-Social Science University of Manchester. Terms of Reference. Data: numbers, characters, images which can processed and transmitted by [humans] and [machines].
E N D
Grid-Enabling Data: Sticking Plaster, Sellotape, & Chewing Gum? Colin C. Venters c.venters@ncess.ac.uk National Centre for e-Social Science University of Manchester
Terms of Reference • Data: numbers, characters, images which can processed and transmitted by [humans] and [machines]. • Unstructured. • Semi-structured. • Structured. • Database Management System (DBMS): a suite of programs which manage the storage and retrieval of large structured sets of persistent data. • Database: one or more large structured sets of persistent data and one component of a database management system. • Federated databases: data integration using middleware.
What’s in a Grid? • Computational Grids - high performance computing resources. • Data Grids - access to heterogeneous datasets. • Access Grid - advanced video conferencing-based collaborative environment. • The Grid makes it possible to share heterogeneous, distributed resources over a network.
The Grid Metaphor Mobile Access Supercomputer, PC-Cluster G R I D M I D D L E W A R E Workstation DBMS, Sensors, Experiments Visualization Networks
Data Integration • Unimpeded use of distributed, heterogeneous, autonomous data resources. • Integrated view of the data resources that allow users to interact with them as if they constituted a single, global, integrated data resource. • Data integration fosters collaboration - one of the fundamental goals of e-research. • Limited DBMS support for Grid integration.
Grid-Enabling: Grid Middleware • GridFTP • High-performance data transfer protocol. • Storage Resource Broker (SRB) • Uniform interface to a virtual distributed data storage resource. • Open Grid Services Architecture Data Access and Integration (OGSAI-DAI) • Grid Data Service (GDS). • Standard interface for database access. • Grid Data Service Factory (GDSF). • Establishes a database service instance. • Database Access and Integration Service Group Registry (DAISGR). • Identifies available database services. • OGSA-DQP • Distributed Query Processing i.e. search across multiple databases.
ConvertGrid • ESRC pilot demonstrator project (PDP) in e-Social Science Programme. • Research problem: investigating complex research questions that require the combination of datasets from multiple sources. • Data management: • Access to multiple datasets. • Data fusion: • Multiple geo-referenced data sets i.e. different target geographies e.g. 1991 Wards, 1991 Postcode Sectors. • Converts data sources with different native geographies to a common Target Geography. • CSV or XML format. • Results returned as a string or streams (FTP/HTTP/GridFTP).
Challenges • Scalability: • Performance and capacity requirements. • Security: • Use of Grid Security Infrastructure (GSI) at the Grid service client level is a non-trivial problem. • Heterogeneity: • Infrastructural. • Syntactic. • Semantic. • Metadata: • Adds contexts to data aiding identification, location, and interpretation.
Further Reading • Watson, P. (2003). Databases and the Grid. In: Grid Computing: Making The Global Infrastructure a Reality, F. Berman, G. Fox, and A. J. G. Hey (eds.), Wiley, pp. 363-384. • Cole, K. et al. (2003). Grid Enabling Quantitative Social Science Datasets: A Scoping Study. ESRC • Atkinson, M. et al. (2004). Data Access, Integration, and Management. In Foster, I. and Kesselman, C. The Grid2: Blueprint for a New Computing Infrastructure, Elsevier, p. 391-429.
Acknowledgements • ConvertGrid Team, University of Manchester • Keith Cole, Jon McLaren, Pascal Ekin, Linda Mason, Stephen Pickles, and Justin Hayes. • Paul Watson, University of Newcastle • Alvaro Fernandes, University of Manchester • Mike Mineter, National e-Science Centre, University of Edinburgh