210 likes | 229 Views
Grid for gravitational wave search: Virgo C. Palomba (INFN Roma) for the Virgo Collaboration
E N D
Grid for gravitational wave search: Virgo C. Palomba (INFN Roma) for the Virgo Collaboration • Virgo is a detector for gravitational waves located in Cascina, near Pisa (Italy), funded by the French CNRS (Centre National de la Recherche Scientifique) and the Italian INFN (Istituto Nazionale di Fisica Nucleare) • 3km-arms Michelson interferometer with Fabry-Perot cavities + recycling cavity • 15 labs from 3 countries (France, Italy, Holland)
Several detectors operating in the world GEO600, Hannover, Germany LIGO – Hanford, WA VIRGO, Pisa, Italy LIGO – Livingston, LA TAMA300, Japan
The sensitivity is approaching the design one • Agreement signed between Virgo, LIGO and GEO600 detectors to share data, coordinate upgrade activities and the planning of future developments (May, 2007): • false alarm probability reduction; - source position measurement; - better sky coverage; - longer observation time
Experiments for the detection of GW, like Virgo, are a challenge from many points of view including data management and the computational effort needed for the analysis of data. • Continuous stream of data (differently from hep experiments); data flow ~0.5TB/day from ~1000 channels • Big amount of data distributed among 3 main sites + tens of users accessing/downloading subsets of the data • data flow: ~7MB/s (~0.5TB/day) • 70TB buffer Cascina (Tier-0) LIGO • data exchange with LIGO • 0.375MB/s (RDS) • long term archive (HPSS) Lyon (Tier-1) Bologna (Tier-1) • 2 years of data on disk • (partial) backup on Castor • users want to access data
Grid tools not currently used for bulk data transfer • A bbftp-based tool is to perform bulk data transfer among Tier-0,1 • User access to data mainly takes place with non-grid tools/protocols: Unix commands, rfio, Xrootd,… • grid tools (gridFTP + LFC - LCG File Catalogue) routinely used for the CW search see later • A file and metadata catalogue, VDB (Virgo Data Base), has been developed. A link to the EGEE metadata catalogue (AMGA) should be studied to provide transparent data access.
Under investigation: - EGEE glite FTS (File Transfer Service) for bulk transfer; requires optimization work to perform well. Not completely satisfactory tests, mainly due to the shortage of manpower - SRB (Storage Resource Broker); SRM interface under development - LDR (Lightweight Data Replicator, based on gridFTP + Globus), tool developed in LIGO. Interoperability with EGEE is an issue.
Very weak signals buried into the noise, with known or unknown shape depending on the kind of source • Different analysis pipelines for different kinds of source: coalescing binaries, burst (e.g. supernova explosion), continuous waves (e.g. pulsars), stochastic background • On-line analysis done in Cascina; off-line analysis mainly in Bologna and Lyon CC • Some of the pipeline requires large computational resources • One of these (CW) already runs on the grid, other could follow (SB group has started tests) ~ 60% of all Virgo jobs run in CC (2007) were on the grid (mainly in Bologna)
Where the grid can play a role: - Data transfer from experiment site to computing centers - Data exchange with other GW experiments (LIGO) - User access to data - Data selection using metadata • CPU intensive simulation/data analysis • What the grid can offer: - High performance, transparent movement and access to data, independent of the local storage system - Generic metadata catalogue - Transparent access to large computing resources, independent of the local batch system - Security (authorization and authentication)
Some possible use cases: • the user wants to select a set of data, e.g. covering a gps time interval around a GRB event, and download them to his laptop for the analysis without having to know where the data are taken from, and on which kind of storage they are, but with the certainty that they are up-to-date and safe; • the user wants to make a coincidence analysis between event candidates found in the data of two or more detectors without knowing where the data are located but just using the proper metadata information; • the user wants to effectively make an analysis which uses a big amount of data by running jobs ‘near’ data location, in order to reduce network traffic. • CW search: relatively small amount of data but many CPUs; embarrassingly parallel problem see later
Low Mass X-Ray Binaries Wobbling Neutron Star Bumpy Neutron Star Virgo CW search on the grid • Need to ‘integrate’ over months of data to increase snr • Need to explore a huge number of points in the parameter space: the signal shape that must be searched into the data depends on source position in the sky, source frequency and frequency derivative • The larger is the available computing power, the larger is the portion of source parameter space that can be covered
Data analysis pipeline Hough transform: from time-frequency plane to source parameter space (frequency, frequency derivative, position in the sky) calibrated data Data quality SFDB Average spect rum estimation Data quality SFDB Average spect rum estimation Embarrassingly parallel problem: the whole frequency band to be searched for (0-2kHz) can be split in several sub-bands each analyzed in a job completely independent on the others peak map peak map It has been easily adapted to run on the grid hough transf. hough transf. candidates coincidences candidates coherent step Input data Analysis Job Output data events
Virgo CW search on the grid: basic steps • Input files are created locally on the UI • Input files are registered in the file catalog (LFC) and replicated to various SEs (lcg-cr). - job gets its input file from the ‘best’ SE. • Output directory is created in various SEs (gridFTP). • A test job is submitted to the target sites. - Select sites more appropriate for the analysis (those satisfying some criteria, like number of free CPUs or ‘closeness’ to SEs having data, or having a given software version, etc). • Analysis jobs are submitted to the selected sites (i.e. those which passed the test).
6. The job downloads locally (i.e. on the WN), or accesses (e.g. via GFAL), the input file. - No need to know where the file physically is (use logical file name). 7. Job runs on the WN. - Analysis software installed on the shared area /opt/exp_software/ on the SE 8. Output file saved into the SEs (directly from within the executable, via lcg-cp). - More copies of the output against SE failures.
We can access resources, both CPU and storage, of the INFN Production Grid available to the Virgo VO, including also Lyon CC ~9000 CPU ...to be shared with other experiments
In 2007, ~40,000 jobs have been submitted to the grid with a total CPU time of ~30 years • Heavier use is expected in 2008, due to the VSR1 run analysis • The grid is now much more robust than in the past and offers some minimal fault tolerance but this is not enough if it must be used intensively (say, ~O(10000) jobs submitted at each run, ~ O(10000) input/output files moved) • Main problems: matchmaking failures, WMS overloading ‘bare’ grid efficiency of ~95% • An efficient exploitation of the grid needs some tool not only to automatize all the tasks but also to face failures in order to minimize human intervention and time overhead in the completion of the analysis
The Supervisor program • To this purpose we have developed a Supervisor program which takes care of: - preparing the environment, installing analysis software; - registering and replicating input data among SEs; - submitting test jobs and selecting sites; - creating jdl files, submitting analysis jobs, monitoring job status; - managing job failures (failed job resubmission), managing core services failures (when possible! - e.g. use a different Resource Broker). • The user can follow the status of the analysis and is notified (via e-mail/sms) when some important event takes place • The fraction of lost jobs is reduced to ~0 (unless there is some major failure of the grid…)
The Supervisor consists of a set of C/C++/scripts daemons running on the UI • A Supervisor’s supervisor has also been developed and tested (but still not used in production): - it is basically an ‘hearth beat’ application which allows to monitor the status of the UI where the SV is running; - in case of crash of the UI, it is replaced by a spare UI where the SV starts; - thanks to a check-pointing mechanism, the analysis management resumes from where it was arrived.
Enhanced LIGO/Virgo+ 2009 Virgo/LIGO 108 ly Adv. Virgo/Adv. LIGO 2014 Credit: R.Powell, B.Berger Future trends and conclusions • The last generation of GW detectors has started taking data at or near their target sensitivity with an aggregate volume of data of ~1PB/year (in case of continuous data taking) • In the next years a new generation of interferometric detectors will come on-line
They will act as a network of detectors • Computing and data distribution tools will play an increasingly important role and this pushes toward the creation of a largely distributed computing and data storage framework. • The grid can make this possible. The fact that many experiments are relying more and more upon computing grids for their simulation/analysis activities drives the evolution of the grid itself, pushing the emerging of standards that replace, or at least co-operate, with ‘home made’ solutions. • This would contribute to an extension, in the GW community, of the concept of single machine in which not only data but also the resources needed to store and to analyze them are shared in coordinate way. • A key issue is the interoperability among different grids, in particular EU and US grids.
While US grids heavily rely on the Globus software for many core services, EU grids are partially replacing it due to the poor performances of some components (like RLS) • They are basically not interoperable: • a file registered in one catalogue is not ‘seen’ in the other; • a job cannot be directed to a WN of the other grid • On the other hand, the main grid projects (e.g. EGEE and OSG) are developing services in a framework and according to standards defined by the Global Grid Forum, which should foster, on a time scale of very few years, the birth of a “global grid”, really a single distributed computing environment.
What to do in the meanwhile? Wait…. • …or try to build a bridge between the two worlds (as other experiments have done) • Few operative questions we should start with (having enough manpower…): • can Condor be used as a local resource management system of a CE? (Condor is used by LIGO computing farms) • is there a simple way to interface RLS (used in LIGO) with LFC (used in Virgo)? • how to interface metadata catalogues?