The Sequential Access Model for Run II Data Management and Delivery

The Sequential Access Model for Run II Data Management and Delivery Lee Lueking, Frank Nagy, Heidi Schellman, Igor Terekhov, Julie Trumbo, Matt Vranicar, Rich Wellner, Vicky White. URL: www-d0.fnal.gov/~lueking/sam/sequential.html. CHEP98 Sept. 3, 1998

What is The Sequential Access Model: SAM? • Sequential events: Data is stored in files as sequential events. • Data Tiers: Each event is stored in each of several data tiers. • The Event Data Unit (EDU) is the unit of data stored in each tier. • Physical event size: EDU5=5kB/event, EDU50=50kB/event, et cetera. • Physical streaming (clustering): Data categories based on Trigger or reconstruction information • Database catalog: File, Event and Processing Database; Information about the data - event-level, file-level, run-level. Also processing information; static and dynamic.

Data Organization User and physics group (derived) data File & Event Database Event Information Tiers Warm Cache Physical Clustering

How Do I Access Data? • Pipelines: Data access channels tailored for particular processing and analysis patterns. • Pipeline segments: Tapes, drives + Automated Tape Library + Storage Management System, network, group-shared and/or user-private analysis disk. • Example access modes: • Database:Access to event, trigger & other FEDB info. • Thumbnail: Disk resident sketch of each event. • Freight Train: Large data stream file server. • Event Picking: Random event selection from any data tier. • Small Data-set:One or a few files from any data tier.

Data Access Mass Storage Pipeline Consumers File&EventDB Thumbnail Freight Train Pick Event User File =Group of Users =Data flow =File =Disk Storage =Tape Storage =Pipeline Name =Single User =Event File&EventDB

D0 Specifications • Data sizes • Further details • 10-15 exclusive streams preferred. Based on L3 and/or Reconstruction information. • 10% warm (tape or disk) caches of Raw and Medium EDU data. • Possible on-demand reconstruction.

Will SAM Scale to Run II?

Exclusive Streaming See Talk #182: Heidi Schellman, “Assurance of Data Integrity in a Petabyte Data Sample”

Data Handling System Buffer and Cache

SAM Design Details • Network distributed. • Easily scalable. • Works for all access modes. • Uses CORBA interfaces between modules. • Modules being written in JAVA, Python and C++. • File, Event and Processing Database uses ORACLE 8. • Not tightly coupled to: • Tape Mass Storage System. • CPU availability or Batch processing facilities on Farm or Analysis machines. • The D0 event data model.

Main Components • File and Event Database: Info about data location and processing details. (see poster session #127: Vicky White, “Use of ORACLE in Run II for D0” ) • Global Optimizer: Optimizes tape access and regulates bandwidth to various stations and activities. • Station: Management for a set of processing resources, including buffer and Data I/O. • Project Master: Responsible for managing projects which are lists of files to process. • Consumer/producer: Actual data processing • GUI and API user interfaces: Allow users to access data and administrators to control the system.

Components of SAM Consumer/ Producer User & Admin. Interface (API and GUI) Consumer/ Producer Station F Consumer/ Producer Station A Station E Consumer/ Producer Project Master DB and Information Servers Mass Storage System Consumer/ Producer Global Optimizer Station D Station B Station C

File and Event Database Run Volume Data Tier Events ID Event Number Trigger L1 Trigger L2 Trigger L3 Off-line Filter Thumbnail Files ID Name Format Size # Events Physical Data Stream Trigger Configuration Project Event-File Catalog Processing Info

(Mass Storage System Needs) • Provide access to data through file-level semantics. • Manage all tape activity within the ATL(S) and to/from shelf. • Allow data to be physically clustered in tape groupings or “file families”. • A mechanism for sending priorities with file requests to allow control over allocation of resources for various activities. • System must optimize the use of resources such as arm time and tape mounts. • Retry and fail-over features for failed tape read/write activities. • Open tape format to allow removal of tapes and exchange of data with other sites. • Reliable and unattended operation. See ENSTORE presentation #126: Don Patravic, “ENSTORE - An Alternative Data Storage System”

Access to Data through SAM • User or group defines a “project” by sending a list of constraints or file list to the Database Server. • DB Server returns a summary of the project (number of files, size and availability). • User is provided a list of possible “stations” where the project might run. He chooses one. • User registers with the station for a given (new or existing) project. He is given a unique “key” to use. • User’s client “consumer/ producer” sends the “project master” on the chosen station the “key”, and is given the next available file in the “project”.

Consumer- Read from Storage

Producer - Write to Storage

SAM Prototype • Status: Being built, ready early October. • Goals: • Populate and exercise the SAM database. • Specify projects - data to be accessed for processing or analysis. • Attach to a ‘Station’ which makes files for that Project accessible. • Interface to ENSTORE - get/put files - using SAM “Global Optimizer”. • Build Analysis programs using D0 framework. • Demonstrate multiple Stations, Projects, Analysis consumers . • Testing: Further testing in fall with SAM PC test-bed. • Beta version: Plan to make MC data available through SAM late ‘98.

SAM Prototype PC test-bed Example configuration Enstore Warehouse Network HUB SAM Station Servers Consumers/Producers Main Backbone To Database Server

Summary • Dzero plans to use a file based Sequential Access Model for run II data access. • The design is network distributed with CORBA communication between modules written in JAVA, PYTHON and C++. ORACLE 8 is used for the DB. • A SAM prototype is being built now and will be ready in Early October. • Hardware to construct a SAM test-bed will be assembled this fall to more fully test and understand the system. • We plan to employ the system for MC data by the end of `98, and perform large-scale testing with Run II hardware the first part of next year.

The Sequential Access Model for Run II Data Management and Delivery

The Sequential Access Model for Run II Data Management and Delivery

Presentation Transcript

Grid Job, Information and Data Management for the Run II Experiments at FNAL

Hydrological Data Access and Simple Model Building

Sequential Circuit II

ALICE data access model

EGL Sequential File Access

ALICE Data Access Model

Growth Model Users Group Growth Model Run-off II

Database Management Systems Chapter 3 The Relational Data Model (II)

The Data Access Layer for D0 Run II Design and Features of SAM

The CDF Run II Data Catalog and Data Access Modules

The CDF Run II Data Catalog and Data Access Modules

Data Delivery - the Evolving Model

Grid Job, Information and Data Management for the Run II Experiments at FNAL

Identity and Access Management for HIPAA: Technology Model

Identity and Access Management: a Functional Model

Dynamic Model Data Management

Sequential data access with Oracle and Hadoop : a performance comparison

Identity and Access Management: a Functional Model

A Data Access Framework for ESMF Model Outputs

The Data Management and Information Delivery (DMID) Project

Identity and Access Management for HIPAA: Technology Model

Identity and Access Management Capability Model