330 likes | 480 Views
A prototype for an extended PROOF. What is PROOF ? ROOT analysis model … … on a multi-tier architecture Status New development Prototype based on XRD Demo. G. Ganis / CERN PH-SFT, June 2005. The ROOT analysis model: Trees.
E N D
A prototype for an extended PROOF • What is PROOF ? • ROOT analysis model … • … on a multi-tier architecture • Status • New development • Prototype based on XRD • Demo G. Ganis / CERN PH-SFT, June 2005
The ROOT analysis model: Trees • Main data structure in ROOT, extending the concept of PAW ntuple • Collection of independent entries • Organized in • Leafs (basic type, array, C++ object) • Branches (collection of Leafs / Branches)
The ROOT analysis model: Trees (cnt’d) • Efficient access to portions of entry data • Several facilities to work with trees • Tree friends (TTree::AddFriend): • extend an existing tree without touching it • e.g. an experiment read-only tree with user-specific • branches / leafs • Tree chains (TChain) • list of trees to make tree size virtually unbounded • (typical size of single tree is < 2 GB) • In all cases the result behaves exactly as a single tree
The ROOT analysis model: Selector • TSelector: main tool to define the data processing strategy • Simple structure • Framework automatically generated for a tree • tree->MakeSelector(“MySelector”) void MySelector::Begin(TTree *tree) { // method called before starting the event loop fPtBranch = tree->GetBranch(“pt”); fPtBranch->SetAddress(&fPt); fPtHist = new TH1F(“Pt”,”Pt”,100,0.,400.); } Bool_t MySelector::Process(Long64_t entry) { // Method called for each entry in the tree fPtBranch->GetEntry(entry); fPtHist->Fill(fPt); } void MySelector::Terminate() { // method called when the event loop is over fPtHist->Draw(); } Read only what is needed by the algorithm
The ROOT analysis model: h1 analysis example { // localProcessing.C // Define the data set TChain a("h42"); a.Add("/home/ganis/rootdata/dstarmb.root"); a.Add("/home/ganis/rootdata/dstarp1a.root"); a.Add("/home/ganis/rootdata/dstarp1b.root"); a.Add("/home/ganis/rootdata/dstarp2.root"); // Process the selector a.Process("h1analysis.C"); } root [0] .x localProcessing.C Starting h1analysis with process option: Starting h1analysis with process option: Processing file: /home/ganis/rootdata/dstarmb.root Processing file: /home/ganis/rootdata/dstarp1a.root Processing file: /home/ganis/rootdata/dstarp1b.root Processing file: /home/ganis/rootdata/dstarp2.root FCN=70.4023 FROM MIGRAD STATUS=CONVERGED 220 CALLS 221 TOTAL EDM=1.37834e-08 STRATEGY= 1 ERROR MATRIX ACCURATE EXT PARAMETER STEP FIRST NO. NAME VALUE ERROR SIZE DERIVATIVE 1 p0 9.59988e+05 9.07051e+04 7.92857e+01 -2.69331e-09 2 p1 3.51130e-01 2.32881e-02 4.69706e-05 5.29292e-03 3 p2 1.18502e+03 5.95938e+01 6.72112e-01 2.29626e-06 4 p3 1.45569e-01 5.93851e-05 8.69320e-07 -1.75027e+00 5 p4 1.24388e-03 6.63103e-05 7.86533e-07 -6.72432e-01 Real time 0:00:17.563133, CP time 5.880
PROOF • Why ? • Data to be analyzed only rarely can be all local • Data transfer of full data sets takes time • Goal: provide a tool for interactive analysis on a heterogeneous cluster • exploit inter-independence of entries in a tree • basic parallelism achieved by splitting the data into packets of variable • size distributed to participant nodes • Focus on: • Transparency • same selectors, … on PROOF as in local session • Scalability • linear scaling up to large number of workers (tested up to 1000) • Adaptability • cope automatically with different cluster configurations and • varying running conditions / perfomances Motto: Bring the KiloBytes to the PetaBytes and not the PetaBytes to the KiloBytes
proofd proofd proofd PROOF: connection layer slave n slave 1 proofd proofd execv() fork() execv() fork() … proofslave proofslave proofserv master execv() proofd fork() client parentproofd (always running) childproofd (transforming in proofserv / proofslave) proofserv / proofslave : TProofServ instances
PROOF: data access strategies • Each slave get assigned, as much as possible, packets representing data in local files • If no (more) local data, get remote data via (x)rootd, rfiod or dCache (needs good LAN, like GB eth) • In case of SAN/NAS just use round robin strategy
PROOF: processing algorithms • TSelector adapted to PROOF • Natural additions • Input list: code to be run, … • Output list: results • Methods to initialize and • finalize processing within • a slave • Method to init a tree void MySelector::Begin(TTree *tree){ // called in the client for local inits } void MySelector::SlaveBegin(TTree *tree) { // called in each slave before processing fPtHist = new TH1F(“Pt”,”Pt”,100,0.,400.); fOutput->Add(fPtHist); } void MySelector::Init(TTree *tree) { // called at each tree change fPtBranch = tree->GetBranch(“pt”); fPtBranch->SetAddress(&fPt); } Bool_t MySelector::Process(Long64_t entry){ // called for each entry in the tree fPtBranch->GetEntry(entry); fPtHist->Fill(fPt); } void MySelector::SlaveTerminate() { // called in each slave after processing } void MySelector::Terminate() { // called in the client after processing fPtHist->Draw(); } Defines the list of objects wanted back Objects with Merge() method are automatically merged in Terminate The modified TSelector works also in non-PROOF sessions
PROOF: the data • Data set: dedicated class TDSet • Specifies a collection of files • with objects • Understands logical file names • Could be return by a query to • a database or file catalog or … • API very close to TChain { // proofProcessing.C // Define the data set TDSet a(“TTree”,"h42"); a.Add(“root://oplapro62.cern.ch//tmp/dstarmb.root"); a.Add(“root://oplapro62.cern.ch//tmp/dstarp1a.root"); a.Add(“root://oplapro62.cern.ch//tmp/dstarp1b.root"); a.Add(“root://oplapro62.cern.ch//tmp/dstarp2.root"); // Process the selector a.Process("h1analysis.C"); }
PROOF: running the query Executing … root[0] gROOT->Proof(“pcepsft43.cern.ch”) PROOF set to parallel mode (10 slaves) root[1] .x proofProcessing.C Starting h1analysis with process option: Starting h1analysis with process option: Processing file: /tmp/ganis/rootdata/dstarp1a.root Processing file: /tmp/ganis/rootdata/dstarp2.root Starting h1analysis with process option: Processing file: //tmp/ganis/rootdata/dstarmb.root Processing file: //tmp/ganis/rootdata/dstarp1b.root Processing file: //tmp/ganis/rootdata/dstarp2.root FCN=70.4023 FROM MIGRAD STATUS=CONVERGED 220 CALLS 221 TOTAL EDM=1.37834e-08 STRATEGY= 1 ERROR MATRIX ACCURATE EXT PARAMETER STEP FIRST NO. NAME VALUE ERROR SIZE DERIVATIVE 1 p0 9.59988e+05 9.07051e+04 7.92857e+01 -2.69331e-09 2 p1 3.51130e-01 2.32881e-02 4.69706e-05 5.29292e-03 3 p2 1.18502e+03 5.95938e+01 6.72112e-01 2.29626e-06 4 p3 1.45569e-01 5.93851e-05 8.69320e-07 -1.75027e+00 5 p4 1.24388e-03 6.63103e-05 7.86533e-07 -6.72432e-01 root[2]
PROOF: additional features • Possibility to upload and / or build additional packages • packed as PAR file (Proof ARchive, as Java JAR …) • gProof->UploadPackage(“MyPackage.par”) • gProof->EnablePackage(“MyPackage”) • Cache system to minimize the number of file transfers • File identity and integrity using message digest technology • Feedback information at configurable time intervals
PROOF: realtime feedback Chain definition (header) is fetched from the PROOF master Feedback histogram, updated every (e.g.) 1 second
PROOF on clusters • PROOF can use “resource brokers” to find out where to start the slaves • PROOF can use file catalogs to locate the files to be analysed • Concrete examples: • Interface with Condor Computing-On-Demand system • master start the slaves as COD jobs • PEAC: PROOF-Enabled Analysis Cluster • Complete event analysis solution: • data catalog, resource broker, PROOF • TGrid: abstract Grid interface for all Grid services • Concrete implementation for Alien // Connect TGrid *alien = TGrid::Connect(“alien”); // Query TGridResult *res = alien->Query(“lfn:///alice/simulation/2001-04/V0.6*.root“); // Data set TDSet *treeset = new TDSet("TTree", "AOD"); treeset->Add(res); // use files in result set to find remote nodes gROOT->Proof(res); treeset->Process(“myselector.C”);
PROOF: current limitations • Originally intended for short queries • TDSet::Process blocks until is done • Stateful connection • everything is lost if the connection is lost or cut • Originally designed for a local cluster • static configuration • Robustness of some components • Interrupt control-flow based on Out-Of-Band messages • Authentication when different protocols are required at different steps • Sandbox when user account not available • Documentation
PROOF: team for new developments • Maarten Ballintijn • Marek Biskup • Rene Brun • Derek Feichtinger (ARDA) • G.G. • Guenter Kickinger • Andreas Peters (ARDA) • Fons Rademakers
PROOF: new development fields • Interactive batch • stateless connection • non blocking queries • Robusteness • Get rid of OOB messages • Setup/ configuration issues • zero-config setup • allow slaves to come and go • Grid interfacing • efficient use of grid information (catalogs, resource brokers, …) • Performance issues • targeted read ahead, improved caching, query estimators • Authentication • Adopt XROOTD framework • Analysis issues: • Tree friends, event lists, indices • GUI, Browsing
XPD: communication layer for PROOF based on XROOTD • Transfer of state from the client to the PROOF cluster requires a manager on the • cluster side keeping track of existing sessions and query submissions • XROOTD (in ROOT since v 4.01.02), provides a generic main component (xrd) • for handling of networking issues and protocol scheduling, and utilities tools (forking, • error handling, security, …) on which the manager can be based on • Candidate to introduce • interactive-batch mode: • possibility to leave a session if a query takes too long and • reconnect later to pick-up the results • non-blocking query submission: • possibility to detach from the query while being processed • (even for potentially short queries) • more robust authentication system
How does XROOTD work • Multi-component server based on a multi-thread architecture • xrd component: provides networking, thread management, protocol scheduling • Minimal sets of threads: • Acceptor: opens connection; matches the protocol; submits job to scheduler • Pollers: react to any activity on open links; submit job to scheduler • Scheduler: schedules work to be done (jobs) • Worker(s): wait for job to be done • Buffer manager: dynamically optimizes use of memory buffers • Workers created / destroyed following needs • Links not attached to a specific worker: first worker free takes the job • Jobs ≡ data/information to be processed for a given link
poller accept How does XROOTD work files XROOTD XrdXrootdProtocol XrdJob WN BM scheduler links • one XrdXrootdProtocol instance per physical • connection (i.e. per client session) • client gateway to the files: used to communicate • with all the files the client wants to access on that • specific server
poller accept How does XPROOFD work proofserv XPROOFD XrdProotdProtocol XrdJob WN static area scheduler links • one XrdProotdProtocol instance per physical • connection (i.e. per client session) • client gateway to proofserv • static area keeps all the relevant information about • a user and its activities on the cluster
XPROOFD: communication layer slave 1 slave n fork() fork() … XrdProofd XrdProofd proofslave proofslave xc xc PO PO xc proofserv xc TXPSocket xc PO XRD pollers fork() master xc XrdProofd client PO
Basic ingredients • Client side: • new class TXPSocket • TSocket interface understanding the new communication protocol • new class TXProofMgr • reflects the status of a client vis-à-vis of a given cluster • start / attach sessions, described by TProof instances (no more unique) • Server side: • new implementation of XrdProtocol, XrdProofdProtocol • client gateway to the cluster, one-to-one relation to TXProofMgr • static area to describing the persistent information (server lifetime) • new class XrdProofSrv • proxy to the external processor (proofserv), submitted queries, results, … • one per external processor
TXPSocket • Separate thread for receiving messages • Intensive use of unsolicited messages • normal asynchronous messages (i.e. in Collect) • interrupts (no OOB) • ping functionality • Synchronous and asynchronous messages posted in • separate queues • Interrupt handler waken up with internal SIGURG • (from reader to main thread) • Ping treated as a special interrupt (level 0)
TXPSocket – Reader thread TCP connection recv() interrupts SIGURG sync msg async msg Post event
XPD: Demo! • Results achieved with the realistic prototype • Multi-sessions • Disconnect / Reconnect • Process: blocking query • Submit: non-blocking query • Finalize results from different sessions • Archive results to /afs using same daemon as file server
XPD: what next • Deep test of the communication layer • latencies • synchronization problems • Test with large realistic number of slaves • Alternatives for internal connection • Enable authentication • XROOTD load balancing?
Other studies • Advanced prototype using a communication layer based on • memory mapped message queue technology (A. Peters, • D. Feichtinger): • full state in message queues • nice recovery features • multi-thread master • queue insertion, configuration, scheduler, packetizer • client frontend • slave splitting in supervisor and processors • not attached to a specific user • better use of resources
Summary • Lot of activity going on to improve the PROOF system • Working prototype with a communication layer based on • XROOTD exists • interactive batch, multi-session, reconnect • Alternative studies may provided good solutions for some • issues • Goal: have the new system in good shape for ROOT05