The SAMGrid Data Handling System

Outline: What Is SAMGrid? Use Cases for SAMGrid in Run II Experiments Current Operational Load Stress Testing Future Project Plans The SAMGrid Data Handling System Wyatt Merritt (FNAL/CD/DØ) for the SAMGrid Team (Talk created by Adam Lyon)

What is SAMGrid? • Data handling system for Run II DØ and CDF • SAMGrid manages file storage (replica catalogs) • Data files are stored in tape systems at FNAL and elsewhere (most use ENSTORE at FNAL) • Files are cached around the world for fast access • SAMGrid manages file delivery • Users at FNAL and remote sites retrieve files out of file storage. SAMGrid handles caching for efficiency • You don't care about file locations • SAMGrid manages file meta-data cataloging • SAMGrid DB holds meta-data for each file. You don't need to know the file names to get data • SAMGrid manages analysis bookkeeping • SAMGrid remembers what files you ran over, what files you processed successfully, what applications you ran, when you ran them and wher

SAMGrid Terms and Concepts • A project runs on a station and requests delivery of a dataset to one or more consumers on that station. • Station: Processing power + disk cache + (connection to tape storage) + network access to SAMGrid catalog and other station cachesExample: a linux analysis cluster at D0 • Dataset: metadata description which is resolved through a catalog query to a list of files. Datasets are named.Examples: (syntax not exact) • data_type physics and run_number 78904 and data_tier raw • request_id 5879 and data_tier thumbnail • Consumer: User application (one or many exe instances)Examples: script to copy files; reconstruction job

Physics Analysis Use Cases • Process Simulated Data • Process Unskimmed Collider Data • Process Skimmed Collider Data • Process Missed/New Data • Analyze Sets of Individual Events ("Pick Events") • Note that there are more use cases for production of simulated data and reconstruction not covered here

I. Process Simulated Data • Look up simulation request with parameters of interest • e.g. Request 5874 has using Pythia with mt = 174 GeV/c2 • Define dataset (via command-line or GUI): • request_id 5874 and data_tier thumbnail • Submit project to SAMGrid station and submit executable instance(s) to batch system (our tools to make that easy) • Consumer is started • Station delivers files to executable instance(s) • Station marks which files were delivered and consumed successfully and which had errors

II. Process Unskimmed Collider Data • Define dataset by describing files of interest (not listing file names) using command-line or GUI • data_tier thumbnail and version p14.06.01 and run_type physics and run_qual_group MUO and run_quality GOOD • Submit project to SAMGrid station and submit executable instance(s) to batch system (our tools to make that easy) • Consumer is started • Station delivers files to executable instance(s) • Station marks which files were delivered and consumed successfully and which had errors

III. Process Skimmed Collider Data • Someone (a Physics group, the Common Skimming Group, or an individual) has produced skimmed files • They created a dataset that describes these files • You can... • Submit project/jobs using their dataset name OR • Create a new dataset based on theirs and adding additional constraints __set__ DiElectronSkim and run_number 168339 • Submission is same as before

IV. Process Missed/New Data • The set of files that satisfy the dataset query at a given time is a snapshot and is remembered with the SAMGrid project information • One can make new datasets with: • Files that satisfy a dataset but are newer than the snapshot (new since the project ran) • Files that should have been processed by the original project but were not delivered or not consumed __set__ myDataSet minus (project_name myProject and consumed_status consumed and consumer lyon)

V. Analyze Individual Events (Pick Events) • Users want to analyze individual events with lower level data tiers (e.g. raw) • Event displays • Special investigations • The SAMGrid catalog keeps information on each raw event (run#, event#, trigger bits, file) • Users run a pick events tool that creates a dataset and submits an executable to extract and process these events

Current SAMGrid Production Configurations • Large SMP • A 10 TB central cache with 128 attached processors • Mostly used for Pick Events • Local job submission (not Grid) • Linux production farms • Reconstruction, Monte Carlo Production, Reprocessing • Various cache arrangements: distributed on worker nodes, NFS-mounted from a central head node, routed through a central head node and then distributed • Local or Grid (Condor) job submission • Linux analysis clusters • Analysis jobs • Various cache arrangements: distributed on worker nodes, routed through a central head node and then distributed; cached on central node only • Local job submission or remote submission (with experiment-specific tools, not Condor)

SAMGrid Statistics (DØ)Files delivered by month 1999 2000 2001 2002 2003 Run II Begins

SAMGrid Statistics - Usage Data Data from early January 6 until February 24 at DØ 9000 Projects! 233 Different Users!

SAMGrid Statistics - Usage Data ~1% ~500K Files!

SAMGrid Statistics - Usage Data Data from early January 6 until February 24 at DØ 256 TB! Raw Thumbnails + … 8.3 Billion Events!

SAMGrid Statistics - Operations Data

SAMGrid Statistics - Operations Data • Time between Request Next File andOpen File • For CAB and CABSRV1 analysis farms stations • 50% of enstore (tape) transfers occur within 10 minutes. • 75% within 20 minutes • 95% within 1 hour • For CENTRAL-ANALYSIS (SMP station) and CLUED0 (desktop cluster station) • 95% of enstore (tape) transfers within 10 minutes

SAMGrid Statistics - Operations Data

Stress Testing • There are many station parameters to tune • Maximum parallel transfers • Maximum concurrent enstore requests • Configuration of cache disks • … • We're moving away from d0mino to Linux • How robust are these linux machines? • How many projects can they run? • How many concurrent file transfers can they handle? • Running test harness on a small cluster to explore SAMGrid parameter space

SAMGrid Stress Testing max transfers =5 max transfers =1

SAMGrid Projects • Recently Completed: • Python & C++ Client API • Improved DB Schema • Batch system adapters • 1st generation monitoring • Active & Future: • DB-Middleware improvements • 2nd generation monitoring • Improved query language • Grid submission • Conversion to SRM interface for cache functions

Summary • SAMGrid has been successfully used at DØ for all data handling • Over the past year 44 stations consumed (remote = outside fermilab) 3.6 million files (0.46 million files remote) 40 billion events (3 billion events remote) 1.6 Petabytes of data (137 TB of data remote) • ~25 million MC events produced remotely with SAMGrid • ~90 million events reprocessed remotely with SAMGrid • SAMGrid will soon become the data handling system for CDF and later MINOS • SAMGrid deployment on grids is contributing improvements to Grid tools & interfaces

EXTRA SLIDES FOLLOW

ENSTORE Statistics • 0.6 Petabytes in tape storage! Only 5 files unrecoverable (5 GB total; 8ppm loss) !!! One of them was RAW file

Some SAMGrid buzzwords • Dataset Definition • A set of requirements to obtain a particular set of files • e.g. data_tier thumbnail and run_number 181933 • Datasets can change over time • More files that satisfy the dataset may be added to SAMGrid • Snapshot • The files that satisfy a dataset at a particular time (e.g. when you start an analysis job) • Snapshots are static • Project • The running of an executable over files in SAMGrid • Consists of the dataset definition, the snapshot from that dataset definition, and application information • Bookkeeping data is kept - how many files did you successfully process, where did your job run, how long did it take

The SAMGrid Data Handling System

The SAMGrid Data Handling System

Presentation Transcript

Data Handling

Handling Data

Data Handling

Handling Data

Handling Data

Data Handling

Handling Data

Data Handling

Belle II Data Handling System

Data Handling

SAMGrid

Data Handling

Belle II Data Handling System

SAMGrid

Handling Data

Data handling

Data Handling

DATA HANDLING

Data Handling

Worldwide Data Processing with SAMGrid

Handling Data