190 likes | 345 Views
Virtual Data Management for CMS Simulation Production. A GriPhyN Prototype. Goals. Explore virtual data dependency tracking data derivability integrate virtual data catalog functionality use of DAGs in virtual data production Identify architectural issues: planners, catalogs, interfaces
E N D
Virtual Data ManagementforCMS Simulation Production A GriPhyN Prototype
Goals • Explore • virtual data dependency tracking • data derivability • integrate virtual data catalog functionality • use of DAGs in virtual data production • Identify • architectural issues: planners, catalogs, interfaces • hard issues in executing real production physics applications • Create prototypes • tools that can go into the VDT • Test virtual data concepts on something “real”
MCAT; GriPhyN catalogs MDS MDS GDMP DAGMAN, Kangaroo GSI, CAS Globus GRAM GridFTP; GRAM; SRM Which Part of GriPhyN Application = initial solution is operational aDAG Catalog Services Monitoring Planner Info Services cDAG Repl. Mgmt. Executor Policy/Security Reliable Transfer Service Compute Resource Storage Resource
What Was Done • Created: • A virtual data catalog • A catalog scheme for a RDBMS • A “virtual data language” VDL • A VDL command interpreter • Simple DAGs for the CMS pipeline • Complex DAGs for a canonical test application • Kanonical executable for GriPhyN keg • These DAGs actually execute on a Condor-Globus Grid
The CMS Challenge • Remember Rick’s slides and the complexity! • Types of executables (4) • Parameters, inputs, and outputs • Templates of parameter lists • Sensitivities of binaries • Dynamic libraries • Environment variables • Condor-related environment issues – less obvious
The VDL begin v /bin/cat arg –n file i filename1 file i filename2 stdout filename3 env key=valueend filename1filename2 setenv …/bin/cat -n filename3
Dependent Programs begin v /bin/phys1 arg –n file i f1 file i f2 stdout f3 env key=valueend begin v /bin/phys2 arg –m file i f1 file i f3 file o f4 env key=valueend …note that dependencies can be complex graphs
The Interpreter • How program invocations are formed • Environment variables • Regular Parameters • Input files • Output file • How DAGs are formed • Recursive determination of dependencies • Parallel execution • How scripts are formed • Recursive determination of dependencies • Serial execution (now); parallel is possible
Virtual Data CatalogRelational Database Structure: As Implemented
TRANSFORMATION /bin/physapp1 version 1.2.3b(2) created on 12 Oct 1998 owned by physbld.orca DERIVATION ^ paramlist PARAMETER LIST ^ transformation PARAMETER i filename1 FILE PARAMETER p -g LFN=filename1 PFN1=/store1/1234987 PARAMETER PFN2=/store9/2437218 E PTYPE=muon PFN3=/store4/8373636 ^derivation PARAMETER O filename2 FILE LFN=filename2 PFN1=/store1/1234987 PFN2=/store9/2437218 ^derivation Virtual Data CatalogConceptual Data Structure
DAGMan Example TOP generates even random number LEFT and RIGHT divide number by 2 BOTTOM sums Diamond DAG DAGs & Data Structures random f.a f.a half half f.b f.c sum f.d
begin v random stdout f.aendbegin v half stdin f.a stdout f.bendbegin v half stdin f.a stdout f.cendbegin v sum file i f.b file i f.c stdout f.d endrc f.a out.arc f.b out.brc f.c out.crc f.d out.d DAGs & Data Structures II random f.a f.a half half f.b f.c sum
Abstract and Concrete DAGs • Abstract DAGs • Resource locations unspecified • File names are logical • Data destinations unspecified • Concrete DAGs • Resource locations determined • Physical file names specified • Data delivered to and returned from physical locations • Translation is the job of the “planner”
What We Tested • DAG structures • Diamond DAG • Canonical “keg” app in complex DAGs • The CMS pipeline • Execution environments • Local execution • Grid execution via DAGMan
Generality simple fabric à very powerful DAGs DAGs of this pattern with >260 nodes were run.
What We Have Learned • UNIX program execution semantics is messy but manageable • Command line execution is manageable • File accesses can be trapped and tracked • Dynamic loading makes reproducibility more difficult – should be avoided if possible • Object handling *obviously* needs concentrated research effort
Future Work • Working with OO Databases • Handling navigational access • Refining notion of signatures • Dealing with fuzzy dependencies and equivalence • Cost tracking and calculations (w/ planner) • Automating the cataloging process • Integration with portals • Uniform execution language • Analysis of scripts (shell, Perl, Python, Tcl) • Refinement of data staging paradigms • Handling shell details • Pipes, 3>&1 (fd #s)
Future Work IIDesign of Staging Semantics • What files need to be moved where to start a computation • How do you know (exactly) where the computation will run, and how to get the file “there” (NFS, local etc) • How/when to get the results back • How/when to trust the catalog • Double-check file’s existence/ safe arrival when you get there to use it • DB marking of files existence – schema, timing • Mechanisms to audit and correct consistency of catalog vs. reality