1 / 19

Virtual Data Management for CMS Simulation Production

Virtual Data Management for CMS Simulation Production. A GriPhyN Prototype. Goals. Explore virtual data dependency tracking data derivability integrate virtual data catalog functionality use of DAGs in virtual data production Identify architectural issues: planners, catalogs, interfaces

forbes
Download Presentation

Virtual Data Management for CMS Simulation Production

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Virtual Data ManagementforCMS Simulation Production A GriPhyN Prototype

  2. Goals • Explore • virtual data dependency tracking • data derivability • integrate virtual data catalog functionality • use of DAGs in virtual data production • Identify • architectural issues: planners, catalogs, interfaces • hard issues in executing real production physics applications • Create prototypes • tools that can go into the VDT • Test virtual data concepts on something “real”

  3. MCAT; GriPhyN catalogs MDS MDS GDMP DAGMAN, Kangaroo GSI, CAS Globus GRAM GridFTP; GRAM; SRM Which Part of GriPhyN Application = initial solution is operational aDAG Catalog Services Monitoring Planner Info Services cDAG Repl. Mgmt. Executor Policy/Security Reliable Transfer Service Compute Resource Storage Resource

  4. What Was Done • Created: • A virtual data catalog • A catalog scheme for a RDBMS • A “virtual data language”  VDL • A VDL command interpreter • Simple DAGs for the CMS pipeline • Complex DAGs for a canonical test application • Kanonical executable for GriPhyN  keg • These DAGs actually execute on a Condor-Globus Grid

  5. The CMS Challenge • Remember Rick’s slides and the complexity! • Types of executables (4) • Parameters, inputs, and outputs • Templates of parameter lists • Sensitivities of binaries • Dynamic libraries • Environment variables • Condor-related environment issues – less obvious

  6. The VDL begin v /bin/cat arg –n file i filename1 file i filename2 stdout filename3 env key=valueend filename1filename2 setenv …/bin/cat -n filename3

  7. Dependent Programs begin v /bin/phys1 arg –n file i f1 file i f2 stdout f3 env key=valueend begin v /bin/phys2 arg –m file i f1 file i f3 file o f4 env key=valueend …note that dependencies can be complex graphs

  8. The Interpreter • How program invocations are formed • Environment variables • Regular Parameters • Input files • Output file • How DAGs are formed • Recursive determination of dependencies • Parallel execution • How scripts are formed • Recursive determination of dependencies • Serial execution (now); parallel is possible

  9. Virtual Data CatalogRelational Database Structure: As Implemented

  10. TRANSFORMATION /bin/physapp1 version 1.2.3b(2) created on 12 Oct 1998 owned by physbld.orca DERIVATION ^ paramlist PARAMETER LIST ^ transformation PARAMETER i filename1 FILE PARAMETER p -g LFN=filename1 PFN1=/store1/1234987 PARAMETER PFN2=/store9/2437218 E PTYPE=muon PFN3=/store4/8373636 ^derivation PARAMETER O filename2 FILE LFN=filename2 PFN1=/store1/1234987 PFN2=/store9/2437218 ^derivation Virtual Data CatalogConceptual Data Structure

  11. DAGMan Example TOP generates even random number LEFT and RIGHT divide number by 2 BOTTOM sums Diamond DAG DAGs & Data Structures random f.a f.a half half f.b f.c sum f.d

  12. begin v random stdout f.aendbegin v half stdin f.a stdout f.bendbegin v half stdin f.a stdout f.cendbegin v sum file i f.b file i f.c stdout f.d endrc f.a out.arc f.b out.brc f.c out.crc f.d out.d DAGs & Data Structures II random f.a f.a half half f.b f.c sum

  13. DAGs & Data Structures III

  14. Abstract and Concrete DAGs • Abstract DAGs • Resource locations unspecified • File names are logical • Data destinations unspecified • Concrete DAGs • Resource locations determined • Physical file names specified • Data delivered to and returned from physical locations • Translation is the job of the “planner”

  15. What We Tested • DAG structures • Diamond DAG • Canonical “keg” app in complex DAGs • The CMS pipeline • Execution environments • Local execution • Grid execution via DAGMan

  16. Generality simple fabric à very powerful DAGs DAGs of this pattern with >260 nodes were run.

  17. What We Have Learned • UNIX program execution semantics is messy but manageable • Command line execution is manageable • File accesses can be trapped and tracked • Dynamic loading makes reproducibility more difficult – should be avoided if possible • Object handling *obviously* needs concentrated research effort

  18. Future Work • Working with OO Databases • Handling navigational access • Refining notion of signatures • Dealing with fuzzy dependencies and equivalence • Cost tracking and calculations (w/ planner) • Automating the cataloging process • Integration with portals • Uniform execution language • Analysis of scripts (shell, Perl, Python, Tcl) • Refinement of data staging paradigms • Handling shell details • Pipes, 3>&1 (fd #s)

  19. Future Work IIDesign of Staging Semantics • What files need to be moved where to start a computation • How do you know (exactly) where the computation will run, and how to get the file “there” (NFS, local etc) • How/when to get the results back • How/when to trust the catalog • Double-check file’s existence/ safe arrival when you get there to use it • DB marking of files existence – schema, timing • Mechanisms to audit and correct consistency of catalog vs. reality

More Related