130 likes | 138 Views
This report outlines the workshop's key points on dynamic data storage, robust data movers, and dataflow automation in large-scale environments. It covers placement strategies, quality of service, access controls, and future data management challenges including costs, gaps, and priorities.
E N D
Storage, Data Movement, Grid, Network subgroup DOE Data Management WorkshopDay 1, 5-22-04
Plan • Outline of the report • Additions refinement and corrections • Costs and gaps • Classification into development hardening deployment
Storage, Data Movement, Grid, Network (initial) • Dynamic Data Storage and Caching • Robust Terrabyte Scale data movers • Dataflow automation between components • Multi-resolution Data movement
The whole system environment in the large (even across a WAN) • Placement depending on • Separate (more abstract more high level) • Mechanism (apropos to layering) • Policy (apropos to layering) • Includes “Storage Management “ • Placement QoS and Qos derived from Policy • Management of replicas • Access • Robust, performant, at large volumes of data. (x500-x1000 in 5-10yrs). • N.b faster than evolution of disk speed • Dynamic Data Storage and Caching • Includes pre-staging. • Supporting for sending the function and’/or query to the data. • Access QoS • Security, authorization authentication and access control. • Dataflow automation between components • E.g apropos to workflows, and systematic integration.
Specialized specific needs • Multi-resolution Data movement • Fine-grained object access and latencies
Gaps costs, priorities. • Cost • $ o >100,000 • $$ o >1,000,000 • $$$ o >10,000,000 • Priority – low med high • High – barrier to Science • Med – substantial cost or waste • Low – annoying • Type of work • RD, HP, DS
Gaps, Costs, Priorities • Placement depending on…. $$$, H, RD,HP,DS • Storage Management (storage space availability, quality, etc) • Permanence at the archival scale • Investigation of how to do this apropos to Scientific Storage syst. • Analogs to industry – information life cycle management • Appropriate mix of Exposed interfaces and hints with a preference for standard interfaces (as opposed to parochial, per-system interfaces) • Automatic and manual configurations need to be investigated • Including hints about future accesses.
Physical Considerations • How to deal with the increase of capacity per device. $$, H ,RD?,DS? • No aspects of performance expand with Moore’s law • Possibly mitigated by placement strategy: • Mixed “temperature” on the same spindle
Gaps, Costs, and Priorities • Management of replicas ($$; H; RD, HP, DS) • Movement of files • Movement of namespaces. • Less-than-whole-file level replication. • Consistency of replicated files • Write once (immutable file) case is an important use case. • Investigation of utility of mutable files • Trade off of version management v.s. mutable files
Gaps, Costs, and Priorities • Access (movement) ($$$; H, RD HP DS) • Access requirements are increasing faster than evolution of disk speed. • Exploitation of IP and non-IP based networking • Access contention on physically large volume. • Latency v.s. small grained access. • Investigate supporting sending the function and’/or query to the data. • Investigate supporting virtual data techniques • Investigation of choice of copies and choice of path • Investigation of where to put compression in system architectures.
Gaps costs…. • Security, authorization authentication and access control. ($, M, RD, DS) • Investigate expression of access control • And how it moves with the data. • Dataflow automation between components ($$; H; RD, hs, DS) • API for wide area distributed computing, exposing as apropos many items mentioned • Scheduling., access optimization analogous to query optimization.
Gaps Costs Priorities • Multi-resolution Data movement • Restricted to framework and not solving specific problems ($, ?, ?)? • Important use case for Office of Science • Investigate if a special case of moving functions to the data. (appropriate framework)