1 / 18

The Grid Collector: Using an Event Catalog to Speed up User Analysis in Distributed Environment

The Grid Collector: Using an Event Catalog to Speed up User Analysis in Distributed Environment. Wei-Ming Zhang Kent State University Kesheng Wu , Arie Shoshani Lawrence Berkeley National Laboratory Victor Perevoztchikov, Jerome Lauret Brookhaven National Laboratory.

pia
Download Presentation

The Grid Collector: Using an Event Catalog to Speed up User Analysis in Distributed Environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Grid Collector: Using an Event Catalog to Speed up User Analysis in Distributed Environment Wei-Ming Zhang Kent State University Kesheng Wu, Arie Shoshani Lawrence Berkeley National Laboratory Victor Perevoztchikov, Jerome Lauret Brookhaven National Laboratory CHEP 2004

  2. Problem to resolveA View of a Typical Analysis Process • Users want to analyze “some” (not all) events • Events are stored in millions of files • Files are distributed on many storage systems • To perform an analysis, a user needs to • Prepare an analysis • Write the analysis code • Specify the events of interest • Run an analysis • Locate the files containing the events of interest • Prepare disk space for the files • Transfer the files to the disks • Recover from any errors • Read the events of interest from files • Remove the files CHEP 2004

  3. Run an Analysis • Locate the files containing the events of interest • Need a catalog over events • Prepare disk space for the files • Need a component to manage space automatically • Transfer the files to the disks • Need to automate file streaming into available space • Need automated transfers from mass storage • Recover from any errors • Need automatic recovery from transient failures • Read the events of interest from files • Need automatic iteration over events • Remove the files • Need automatic garbage collection Alternative: static disk population, not optimal for large Datasets or constrained resources CHEP 2004

  4. Design Goals of Grid Collector Primary goal Make analysts more productive by • Allowing to specify events of interest using meaningful physical quantities • numberOfPrimaryTracks > 1000 AND SumOfPt > 20 • Reading only events of interest • Automating the management of distributed files and disks Secondary goals • Working with the existing ROOT based analysis framework • Using a minimal amount of storage • Benefiting a majority of the users CHEP 2004

  5. Components of the Grid Collector Legend: red– new components, purple– existing components • Locate the files containing the events of interest • Event Catalog, file & replica catalogs • Prepare disk space and transfer • Prepare disk space for the files • Disk Resource Manager (DRM) • Transfer the files to the disks • Hierarchical Resource Manager (HRM) to access HPSS • On-demand transfers from HRM to DRM • Recover from any errors • HRM recovers from HPSS failures • DRM recovers from network transfer failures (Track 4, 344 for SRM/HRM/DRM usage) • Read the events of interest from files • Event Iterator with fast forward capability • Remove the files • DRM performs garbage collection using pinning and lifetime Consistent with otherSRM based strategies and tools CHEP 2004

  6. Grid Collector: Architecture Servers Clients Grid Collector Administrator Index Builder In: STAR tag file Out: bitmap index Replica Catalog File Locator In: logical name, Out: physical location Fetch tag file Load subset Rollback Commit Replica Catalog Event Catalog In: conditions Out: logical files, event IDs File Scheduler In: physical file Analysis code New query Event iterator HRM 1 DRM NFS, local disk HRM 2 CHEP 2004

  7. Event Catalog • The Event Catalog is built from the information in the “tag files”. Arbitrary/user defined, arranged as arrays • e.g., Run #, Event #, production time, sum of Pt… • The Event Catalog (EC) also contains persistent logical file names for each event • In STAR, the EC persistent logical file names are composed of the tag file name and the production tag • As a tag file is registered with the File Catalog, its content can be also placed in the Event Catalog • Main operations to build the Event Catalog include: fetch tag files, load new events into the EC, roll back and commit CHEP 2004

  8. Performance of Event Catalog • The Event Catalog uses compressed bitmap indices • The most commonly used index is B-tree • The most efficient one is often the projection index • The following table reports the size and the average query processing time • 1-attribute, 2-attribute, and 5-attribute refer to the number of attributes in a query • Compressed bitmap indices are about half the size of B-trees, and are 10 times faster • Compressed bitmap indices are larger than projection indices, but are 3 times faster CHEP 2004

  9. Event Catalog is Fast Log-log plot of query processing time for different size queries The compressed bitmap index is at least 10X faster than B-tree and 3X faster than the projection index CHEP 2004

  10. Grid Collector :Works with Remote Clients • Main servers • Grid Collector for coordination and query interpretation • File Catalogs for locating multiple copies of files • HRM for managing storage sites including HPSS • Client side requires a DRM for managing local disk storage • An Event Iterator can access local files with or without a DRM • Multiple Event Iterators can work on the same set of events Servers Clients CHEP 2004

  11. GC Requires Minimal Disk Space • The Event Catalog has the same information as the tag files plus some extra information • Since the tags files are much smaller than others, the size of the Event Catalog is also relatively small • Time to build the catalog is much less than the time to generate the tag files • For 13 million events in a 62 GeV production (STAR 2004), • Event Catalog size: 27 GB (tags: 6.0 GB, MuDST: 4.1 TB, event: 8.6 TB, raw: 14.6 TB) • Production time: 3.5 months, 300+ CPUs • Time to build the catalog: 5 days on one machine CHEP 2004

  12. Grid Collector Speeds up Analyses • Test machine: 2.8 GHz Xeon, 27 MB/s read speed • Without Grid Collector, an analysis typically reads all events of a run • Speedup = time to read all events in a run / time to read selected events with Grid Collector • Using Grid Collector is preferred, speedup ≥ 1 • When searching for rare events, say, selecting one event out of 1000, using GC is 20 to 50 times faster • Using GC to read 1/2 of events, speedup > 1.5, 1/10 events, speed up > 2. CHEP 2004

  13. Searching for anti-3He Lee Barnby, Birmingham Previous studies identified collision events that possibly contain anti-3He, need further analysis Searching for strangelet Aihong Tang, BNL Previous studies identified events that behave close to strangelets, need further investigation Success Stories • Without Grid Collector, one has to retrieve every file from mass storage systems and scan them for the wanted events – may take weeks or months • With Grid Collector, both jobs completed within a day CHEP 2004

  14. Main Benefits of Using Grid Collector • If you gather statistics on lots of events • Grid Collector allows you to work with files not already on disk • If you search for rare events, Grid Collector allows you to • Specify the events with ease • Access only relevant files • Read only selected events • If you want to try some analysis ideas outside of the main computer centers, • Grid Collector allows you to select the wanted events easily, and manages file and space for you CHEP 2004

  15. How To Use Grid Collector • In STAR, use of an abstract interface StIOMaker • StIOMaker can now handle all files including MuDST • StIOMaker uses StFile (interface) -> StFileI (implementation) • Replace StFileI with StGridCollector • StIOMaker requires a StFile object • One currently uses “new StFile(…)” to create a StFileI object (default mode) • Grid Collector provides a new way, “StGridCollector::Create(SELECT geant, event WHERE …)” • Iterate through events as usual CHEP 2004

  16. How To Select Events • Basic syntax • SELECT [MuDst|event|…] WHERE SumOfPt > 10 AND chargedMultiplicity > 300 AND… • The SELECT clause identifies the type of files to analyze • The WHERE clause defines the events of interest • The WHERE clause consists of range conditions joined with logical operators AND, OR, NOT. • All tags and a few MetaData Catalog key words can be used in the WHERE clause • Variables with multiple values can be addressed with index, e.g., scaAnalysisMatrix[7] CHEP 2004

  17. Related Work Initial concept of Grid Collector was developed in Storage Access Coordination System (STACS) • STACS was a monolithic system Other projects with similar objectives or similar vision • PROOF: parallel ROOT, designed for tight clusters, require files on disk • JAS3: use a dataset catalog • ARDA: jobs are specified in terms of files • Most of these either reads all events in a file or requires manual file management CHEP 2004

  18. Summary • Grid Collector works with the existing ROOT based analysis framework to speed up analysis jobs • It is efficient and requires minimal addition storage • It automatically retrieves files from remote mass storage • Analysis tasks read only the selected events • Software status • Currently in use by STAR users (testers) • Capable of indexing all new events as they are produced • Contact information • John Wu John.Wu@nersc.gov • Jerome Lauret lauret@bnl.gov • Wei-Ming Zhang zhang@hpaq.kent.edu CHEP 2004

More Related