240 likes | 400 Views
DIAL Distributed Interactive Analysis of Large datasets. SC2003 Phoenix. David Adams BNL November 17, 2003. Goals of DIAL What is DIAL? Design JDL Datasets Grid2003 More information. Contents. Goals of DIAL. 1. Demonstrate the feasibility of interactive analysis of large datasets
E N D
DIALDistributed Interactive Analysis of Large datasets SC2003 Phoenix David Adams BNL November 17, 2003
Goals of DIAL What is DIAL? Design JDL Datasets Grid2003 More information Contents DIAL SC2003
Goals of DIAL • 1. Demonstrate the feasibility of interactive analysis of large datasets • How much data can we analyze interactively? • 2. Set requirements for GRID services • In particular those specific to interactive analysis • Job definition: application, task, dataset • Gathering and relaying results • Real time monitoring (partial results) • Resource management: discovery, allocation, sharing • 3. Provide ATLAS with a useful analysis tool • For current and upcoming data challenges • Like to add another experiment to show generality DIAL SC2003
What is DIAL? • DIAL provides a connection between • Interactive analysis framework • Fitting, presentation graphics, … • E.g. ROOT, JAS, … • and Data processing application • Natural to the data of interest • E.g. athena for ATLAS • DIAL distributes processing • Among sites, farms, nodes • To provide user with desired response time • Look to other projects to provide most infrastructure DIAL SC2003
Design • DIAL has the following major components • Dataset describing the data of interest • Application defined by experiment/site • Task is user extension to the application • Job uses application and task to process a dataset • Result is the output of a job • Scheduler creates and manages jobs • Together these define a high-level JDL • (job definition language) • Figure shows how these components interact → DIAL SC2003
9. fill Job 1 Dataset 1 Dataset 2 Result 7. create 8. run(app,tsk,ds1) Dataset 6. split 10. gather Scheduler 4. select e.g. ROOT User Analysis 1. Create or locate 8. run(app,tsk,ds2) 5. submit(app,tsk,ds) e.g. athena Job 2 2. select 3. Create or select Result Application Task 9. fill Result Code DIAL SC2003
Analysis job rates • At what rate is a site processing sub-jobs? • Assume 1000 CPU’s at a “site” • Continuum of requests with the following extremes • Large scale data production with 3–30 hours/job: • 30-300 jobs/hour (1 job/minute) • Fine for batch and grid schedulers • For interactive analysis with 1-10 seconds/job • 100-1000 jobs/sec (10000 jobs/minute) • Challenging for grid and batch schedulers • Handle with hierarchy of schedulers • Each scheduler hnadles a fraction of the rate • But each level adds latency DIAL SC2003
DIAL scheduler hierarchy DIAL SC2003
JDL • High level job definition language • Enable users to specify task without reference to executables, data files or sites • Scheduler decides where and how to process data • Analysis implies user is easily able to customize task • Common language • Enable different experiments and non-HEP activities to share schedulers • PPDG activity to define such a language • Led by Gabriele Carcassi (STAR) • Similar to DIAL (application, task. dataset, …) • XML based DIAL SC2003
JDL (DIAL perspective) DIAL SC2003
Datasets • Want to provide a high-level data view • Unit of processing is called “dataset” • Many properties beyond data location • Location is not just a list of files (physical or logical) • Multiple logical file set representations • Representation might be tables in an RDB • Or object list in an ODB • Or … • Properties and categories follow DIAL SC2003
Dataset properties • 0. Identity • Dataset must have an unique index and/or name • 1. Content • Description of the type of data in the dataset • Event or non-event data • Simulation, reconstruction, • ESD, AOD, … • Jets, tracks, electrons,… • 2. Location • Where to find the data • Logical files, physical files, site,… • 3. Mapping • Which content is at which location? DIAL SC2003
Dataset properties (cont) • 4. Provenance • Prescription for creating the data • E.g. input dataset and transformation • 5. History • Details of production beyond provenance • How production was split into jobs, • Processing node and time for each job, … • 6. Labels • Assigned metadata outside other categories, e.g. • Integrated luminosity • Result of quality checks • Flag indicating ok for use in published analyses DIAL SC2003
Dataset properties (cont) • 7. Mutability • May dataset be modified? • Possible states: locked, unlocked, extensible, … • 8. Compositeness • Dataset made up of other datasets. • Two cases: • Construction: provenance is the list of sub-datasets • E.g. the summer dataset is defined to be the union of the June, July and August datasets. • Assignment: factorization into sub-datasets • Typically to reflect data placement • E.g. a representation of a global dataset might include sub-datasets in New York, Paris and Moscow. DIAL SC2003
Dataset categories • Categorize datasets according to the extent of their location information • Virtual dataset (VDS) • no location • Nonvirtual dataset (NVDS) • Logical dataset (LDS) • Collection of logical files • Physical dataset (PDS) • Collection of physical files • Staged dataset • NVDS with mapping of sub-datasets to CPU or process DIAL SC2003
Dataset category associations (example) VDS 1 Virtual • LDS 1-2 • {LF3} Logical • PDS 1-1-1 • {PF1A PF2A} • PDS 1-2-1 • {PF3A} Physical • PDS 1-1-2 • {PF1B PF2B} • PDS 1-2-2 • {PF3B} • PDS 1-1-3 • {PF1A PF2B} DIAL SC2003 LDS 1-1 {LF1 LF2}
Dataset implementation • Present dataset implementation includes • Virtual dataset (VDS) • Portable representation of dataset without location • Logical dataset (LDS) and physical dataset (PDS) • Add location expressed in terms of logical files • Dataset database (DDB) • Repository of (immutable) datasets indexed by ID • Dataset selection catalog (DSC) • Enables users to select a VDS • Dataset replica catalog (DRC) • Enables “system” to locate an NVDS representation of a VDS • Dataset file catalog (DFC) • Maps single-file datasets to LFN C++ classes w/ XML rep Files indexed by name MySQL tables DIAL SC2003
Dataset implementation DIAL SC2003
Grid2003 datasets • Define dataset • Provenance, # events and list of LFN’s • Assign dataset name • Create entry in DSC • Produce data with GCE/Pegasus/Chimera • Transfer files to BNL disk directory • Poll destination directory for new files • Register files in Magda • Create single-file dataset (LDS) for each registered file • Store each dataset in DDB (dataset database) • Register LFN-to-dataset association in DFC DIAL SC2003
Grid2003 datasets (cont) • At regular intervals • Merge the current set of single-file datasets to create the latest merged LDS • Use this LDS to create a VDS • Register the VDS-LDS association in the DRC • Update DSC entry with the new VDS DIAL SC2003
Grid2003 analysis • User selects dataset from DSC • Dataset by name will change as data comes in • Dataset by ID is snapshot at time of selection • User submits job • If by name, use DSC to get current dataset ID • Use ID to extract VDS from DDB • Submit dataset and task to DIAL scheduler • Scheduler • Uses DRC to find LDS corresponding to VDS • In principle, the “best” choice for the given task • Splits LDS into single-file sub-datasets • Processes each, gathers and merges results DIAL SC2003
Discovery! • The first Grid2003 dataset was analyzed and a mass was calculated from the four leading electrons • 48k events • Simulated mass was130 GeV • Electron ET > 5GeV DIAL SC2003
More information • DIAL • http://www.usatlas.bnl.gov/~dladams/dial • Datasets • http://www.usatlas.bnl.gov/~dladams/dataset • Grid2003 Datasets • http://www.usatlas.bnl.gov/~dladams/dataset/grid3 DIAL SC2003