220 likes | 228 Views
ACAT'03 PROOF is a collaboration between CERN and MIT Heavy Ion Group to design a system for interactive analysis of large sets of ROOT data files on clusters. It employs parallelism to speed up query processing and aims for transparency, scalability, and adaptability.
E N D
PROOF and Condor Fons Rademakers http://root.cern.ch ACAT'03
PROOF – Parallel ROOT Facility • Collaboration between core ROOT group at CERN and MIT Heavy Ion Group • Part of and based on ROOT framework • Uses heavily ROOT networking and other infrastructure classes ACAT'03
Main Motivation • Design a system for the interactive analysis of very large sets of ROOT data files on a cluster of computers • The main idea is to speed up the query processing by employing parallelism • In the GRID context, this model will be extended from a local cluster to a wide area “virtual cluster”. The emphasis in that case is not so much on interactive response as on transparency • With a single query, a user can analyze a globally distributed data set and get back a “single” result • The main design goals are: • Transparency, scalability, adaptability ACAT'03
stdout/obj proof ana.C proof TFile TFile TFile proof TNetFile proof proof proof = master server proof = slave server Parallel Chain Analysis #proof.conf slave node1 slave node2 slave node3 slave node4 Local PC Remote PROOF Cluster root *.root node1 ana.C *.root $ root root [0] tree.Process(“ana.C”) root [1] gROOT->Proof(“remote”) $ root root [0] tree.Process(“ana.C”) $ root $ root root [0] tree.Process(“ana.C”) root [1] gROOT->Proof(“remote”) root [2] chain.Process(“ana.C”) node2 *.root node3 *.root node4 ACAT'03
PROOF - Architecture • Data Access Strategies • Local data first, also rootd, rfio, dCache, SAN/NAS • Transparency • Input objects copied from client • Output objects merged, returned to client • Scalability and Adaptability • Vary packet size (specific workload, slave performance, dynamic load) • Heterogeneous Servers • Support to multi site configurations ACAT'03
Workflow For Tree Analysis –Pull Architecture Slave 1 Master Slave N Process(“ana.C”) Process(“ana.C”) Initialization Packet generator Initialization GetNextPacket() GetNextPacket() 0,100 Process 100,100 Process GetNextPacket() GetNextPacket() 200,100 Process 300,40 Process GetNextPacket() GetNextPacket() 340,100 Process Process 440,50 GetNextPacket() GetNextPacket() 490,100 Process 590,60 Process SendObject(histo) SendObject(histo) Wait for next command Add histograms Wait for next command Display histograms ACAT'03
Data Access Strategies • Each slave get assigned, as much as possible, packets representing data in local files • If no (more) local data, get remote data via rootd, rfiod or dCache (needs good LAN, like GB eth) • In case of SAN/NAS just use round robin strategy ACAT'03
Additional Issues • Error handling • Death of master and/or slaves • Ctrl-C interrupt • Authentication • Globus, ssh, kerb5, SRP, clear passwd, uid/gid matching • Sandbox and package manager • Remote user environment ACAT'03
Running a PROOF Job • Specify a collection of TTrees or files with objects root[0] gROOT->Proof(“cluster.cern.ch”); root[1] TDSet *set = new TDSet(“TTree”, “AOD”); root[2] set->AddQuery(“lfn:/alice/simulation/2003-04”,“V0.6*.root”); … root[10] set->Print(“a”); root[11] set->Process(“mySelector.C”); • Returned by DB or File Catalog query etc. • Use logical filenames (“lfn:…”) ACAT'03
The Selector • Basic ROOT TSelector • Created via TTree::MakeSelector() // Abbreviated version class TSelector : public TObject { Protected: TList *fInput; TList *fOutput; public void Init(TTree*); void Begin(TTree*); void SlaveBegin(TTree *); Bool_t Process(int entry); void SlaveTerminate(); void Terminate(); }; ACAT'03
PROOF Scalability 8.8GB, 128 files 1 node: 325 s 32 nodes in parallel: 12 s 32 nodes: dual Itanium II 1 GHz CPU’s, 2 GB RAM, 2x75 GB 15K SCSI disk, 1 Fast Eth Each node has one copy of the data set (4 files, total of 277 MB), 32 nodes: 8.8 Gbyte in 128 files, 9 million events ACAT'03
PROOF and Data Grids • Many services are a good fit • Authentication • File Catalog, replication services • Resource brokers • Job schedulers • Monitoring Use abstract interfaces ACAT'03
The Condor Batch System • Full-featured batch system • Job queuing, scheduling policy, priority scheme, resource monitoring and management • Flexible, distributed architecture • Dedicated clusters and/or idle desktops • Transparent I/O and file transfer • Based on 15 years of advanced research • Platform for ongoing CS research • Production quality, in use around the world, pools with 100’s to 1000s of nodes. • See: http://www.cs.wisc.edu/condor ACAT'03
COD - Computing On Demand • Active, ongoing research and development • Share batch resource with interactive use • Most of the time normal Condor batch use • Interactive job “borrows” the resource for short time • Integrated into Condor infrastructure • Benefits • Large amount of resource for interactive burst • Efficient use of resources (100% use) ACAT'03
Normal batch Batch Request claim Batch Activate claim COD Batch Suspend claim Batch COD Resume COD Batch Deactivate Batch Release Batch COD - Operations ACAT'03
PROOF and COD • Integrate PROOF and Condor COD • Great cooperation with Condor team • Master starts slaves as COD jobs • Standard connection from master to slave • Master resumes and suspends slaves as needed around queries • Use Condor or external resource manager to allocate nodes (vm’s) ACAT'03
Master Condor PROOF and COD Condor Batch Slave Client Condor Slave Batch Condor Batch ACAT'03
PROOF and COD Status • Status • Basic implementation finished • Successfully demonstrated at SC’03 with 45 slaves as part of PEAC • TODO • Further improve interface between PROOF and COD • Implement resource accounting ACAT'03
PEAC – PROOF Enabled Analysis Cluster • Complete event analysis solution • Data catalog and data management • Resource broker • PROOF • Components used: SAM catalog, dCache, new global resource broker, Condor+COD, PROOF • Multiple computing sites with independent storage systems ACAT'03
PEAC System Overview ACAT'03
PEAC Status • Successful demo at SC’03 • Four sites, up to 25 nodes • Real CDF StNtuple based analysis • COD tested with 45 slaves • Doing post mortem and plan for next design and implementation phases • Available manpower will determine time line • Plan to use 250 node cluster at FNAL • Other cluster at UCSD ACAT'03
Conclusions • PROOF maturing • Lot of interest from experiments with large data sets • COD essential to share batch and interactive work on the same cluster • Maximizes resource utilization • PROOF turns out to be powerful application to use and show the power of Grid middleware to its full extend • See tomorrows talk by Andreas Peters on PROOF and AliEn ACAT'03