100 likes | 205 Views
Efficient access and query, data integration Group 4 Group coordinators: Alok Choudhary Rob Ross. Parallel and Random I/O. I/O Stacks High-level I/O libraries (PnetCDF, HDF5, SILO) I/O middleware (MPI-IO) Parallel file systems (Lustre, GPFS, PVFS)
E N D
Efficient access and query, data integration Group 4 Group coordinators: Alok Choudhary Rob Ross
Parallel and Random I/O • I/O Stacks • High-level I/O libraries (PnetCDF, HDF5, SILO) • I/O middleware (MPI-IO) • Parallel file systems (Lustre, GPFS, PVFS) • Other shared file systems (CXFS, GFS, Panasas, qfs) • Solutions may exist • Performance/scalability are “ok” • Will these scale to next-generation systems (e.g. BG/L, Red Storm?) • Random I/O • Query metadata for optimizing seemingly random accesses • Research and development • Scale! Not just an engineering problem. • DB-like, query operations (more later) • Recognizing and/or passing on access pattern information, then acting on it • Related to metadata issues • Execution of app. code at the I/O server (active disk • (user) Metadata as file system constructs • Hardening and packaging • Large FC configurations • Fault tolerance • System support • Deployment and maintenance • Low BW, serial applications in good shape • High BW, embarrassingly parallel, task farming
Parallel and Random I/O • Gaps with Priority • Scaling of parallel I/O stack • Both scaling of # of clients, and • Scaling of size of the file system (# of files/objects) • APIs for passing more information to the system • (already there in MPI-IO to some extent, some PFSs, but not adequate, also needed support at the high-level I/O library) • Management of large scale storage • Fault tolerance • Autonomic (self-managing, etc.) storage • Connecting PFSs to hierarchical storage systems efficiently
Large-scale feature-based Queries • Lots of dimensions • existing indexing techniques aren’t particularly good for this • Not worth building an index at all in some instances • Research and development • Parallel update problem with existing representations • When to linear scan, streaming • Hardware-assisted searching (e.g. Netezza, NexQL, Seisint) • Hardening and packaging • Bitmapped indexing, in some use • Deployment and maintenance • Relational DBs • Object DBs
Large-Scale, Feature-Based Queries • Gaps with Priorities • Scalability of techniques, such as indexing, as a solution to this problem • Support for runtime feature extraction • Concurrent update (addition) to indices • only for some groups
Query processing over files • DB-like operations on files • Structured data files such as HDF5, PnetCDF, SILO • Alternative APIs, file format independent • Java database objects, ODMG • Research and development • What should the API look like? • Protocols for accessing databases in distributed environments with arbitrary backends (e.g., GGF DAIS group) • Hardening and packaging • Ad-hoc Query package (LLNL work) • Range queries over SILO mesh data • Root (HEP community) • Operates on files in internal file format • Deployment and maintenance • nothing
Query Processing over Files • Gaps with Priorities • Determining the API for this query processing • What capabilities are needed from this API? • Implementing this API for common file formats • Appropriate underlying optimizations may impact all of I/O stack (e.g. query optimizations, cache management, etc.) • Extensible, parallel runtime for aiding in the use of this API, constructing queries, etc.
Data Integration • Digital libraries, federations and warehousing • Research and development • Tools for aiding in creation of warehouses, ontology creation • Fine-grained access control • Security in federated/dist. environment (pharma etc.) • Applies even to the queries, not just the data itself • Hardening and packaging • Digital libraries (SRB) • Many one-off instances of domain-specific integrations • Deployment and maintenance • DiscoveryLink (IBM), other commercial packages – framework for doing data integration with their DB offerings • Linking similar (R) DBs together isn’t too difficult
Data Integration • Gaps with Priorities • Converging on a language for describing metadata for communities • Tools to support wrapping and integrating complex data • From arbitrary sources (free text, mesh data, etc.), including files • For this domain (community exists looking at bio domain) • Provenance • Security • Cross-domain access and authentication • Encryption of both queries and data • Authentication of data sources