200 likes | 373 Views
Implementing Metadata Using RLS/LCG. James Cunha Werner University of Manchester http://www.hep.man.ac.uk/u/jamwer/. Babar Experiment.
E N D
Implementing Metadata Using RLS/LCG James Cunha Werner University of Manchester http://www.hep.man.ac.uk/u/jamwer/
Babar Experiment • The BaBar experiment studies the differences between matter and antimatter, to throw light on the problem, posed by Sakharov, of how the matter-antimatter symmetric Big Bang can have given rise to today’s matter-dominated universe. • High energy collisions between electrons and positrons produce other elementary particles, giving tracks and clusters which are recorded by several high granularity detectors and from which the properties of the short-lived particles can be deduced. James Werner jamwer2000@hotmail.com
Each recorded collision, called an event, comprises a large volume of data, and thousand of millions of events are recorded, giving a total dataset size of hundreds of thousands of Gigabytes (or hundreds of Terabytes). James Werner jamwer2000@hotmail.com
Sources of Data in Babar James Werner jamwer2000@hotmail.com
Amount of data SuperBabar ! Systematic errors >>> statistical errors Same amount of Monte Carlo Generated data! James Werner jamwer2000@hotmail.com
Data Structure • The user interface to the eventstore: event "collection". Each collection represents an ordered series of N events and a user can choose to read the events from the 1st one in the sequence or from any given offset into the sequence. • Data components: • hdr - event header • usr - user data • tag - tag information • cnd - candidate information • aod - "analysis object data" • tru - MC truth data (only in MC data) • esd - "event summary data" • sim - "sim" data from BgsApp or MooseApp like GHits/GVertices (only in MC data) • raw - subset of raw data from xtc persisted in the Kanga eventstore James Werner jamwer2000@hotmail.com
Data organisation How data are stored (level of detail): • micro = hdr + usr + tag + cnd + aod (+ tru) • mini = micro + esd • Data access: • collections - these are "logical" names that users use to configure their jobs. These are site-independent so (assuming the site has imported the data) the same collection name should work at any site. • logical file names (LFN) - these are site-independent names give to all files in the eventstore. Any references within the event data itself _must_ use LFN's so that these remain valid when they are moved from site to site. • physical file names (PFN) - these are file names that will vary from site to site. In practice they are usually derived from the LFN's by adding a prefix that encapsulates how the data is accessed at that site. James Werner jamwer2000@hotmail.com
Feeding RLS with metadata Generation of basic metadata file with files selection:#!/bin/bashBbkDatasetTcl --dbsite=local > MetaLista.txtcat MetaLista.txt | awk '// {print "BbkDatasetTcl --site local --nolocal \""$1"\"";}' >> geratclchmod 700 geratcl./geratcl Feeding RLS with basic files #!/bin/bashls *.tcl | awk '// {split($1,a,"."); print "edg-rm --vo babar cr file:///home/jamwer/PgmCM2/MetaData/"$1 " -l lfn:"a[1] " > " a[1]".rlstok";}' >> alimrlschmod 700 alimrls./alimrls James Werner jamwer2000@hotmail.com
Conformity CE catalogue Run evaluation software to establish CE conformity and perform catalogue update. #!/bin/bashldapsearch -x -H ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -b 'Mds-vo-name=local,o=Grid' '(&(objectClass=GlueCE)(GlueCEAccessControlBaseRule=VO:babar))' | grep "GlueCEUniqueID:" > cenames.txtcat cenames.txt | awk '// {print "./catal "$2;}' > subload.shchmod 700 subload.sh./subload.shcat loadrlssubm >> $1.histocat $1.histo | awk ' /Sub/ {FileName=$2} /https/ {HandleName=$2; print "echo " HandleName "> " FileName".tok " }' >> gridtokchmod 700 gridtok./gridtok James Werner jamwer2000@hotmail.com
Conformity validation Verify if site follow experiment standards: #!/bin/bashecho Hostname `/bin/hostname`echo Start time: `/bin/date`echolocal=`pwd`echo “Babar initialisation ". $VO_BABAR_SW_DIR/babar-grid-setup-env.shechoecho “Environment variables"printenvechocd $localecho Arquivos disponiveis: $locallsechoecho " - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - "echocd $BFDIST/releases/14.5.2srtpath 14.5.2 Linux24RH72_i386_gcc2953cd $localBbkDatasetTcl --dbsite=local > MetaLista.txtcat MetaLista.txt | awk '// {print "BbkDatasetTcl --site local \""$1"\"";}' >> geratclchmod 700 geratcl./geratclexport CE_NAME=$1ls *.tcl | awk -v site=CE_NAME '// {split($1,a,"."); print "edg-rm --vo babar addAlias `cat " $1"` lfn:"a[1]"."site ;}' >> alimrlschmod 700 alimrls./alimrlsechoecho " - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - "echoecho End time: `/bin/date` James Werner jamwer2000@hotmail.com
Analysis Submission to Grid (Prototype) • Single command: ./easygrid dataset_name • Perform Handlers management and submission • Configurable to achieve user’s requirements • Software based in State-machine • Verify skimdata available: • If not available perform BbkDatasetTCL to generate skimData. Each file will be a job. • Verify if there are handlers pending • If not, script generation (gera.c) with edg-job-submit and ClassAdds, and script execution. Nest for submission policy and optimisation. • If yes, verify job status. When the all jobs ended, recover results in user folder. James Werner jamwer2000@hotmail.com
Job Submission system, metadata and data James Werner jamwer2000@hotmail.com
Metadata/Event files and Computer elements For each dataset there is a metadata file containing the names of the event files. These physical files are registered with the RLS, with several logical file names in the format datsetname_CEJobQueue assigned to them as aliases, showing the CEs which contain copies of that dataset. Searching all the aliases for a dataset name provides a list of CEs to which jobs can be submitted. James Werner jamwer2000@hotmail.com
Managing large files in Grid • The analysis executable is allocated in the SE and its logical file name (LFN) is also catalogued in the RLS so any WN need download it only once. • Metadata not only for data, but to support other files as well. James Werner jamwer2000@hotmail.com
Gera • Generation of all necessary information to submit the jobs on the Grid. • Job Description Language (JDL) files • the script with all necessary tasks to run the analysis remotely at a WN • some grid dependent analysis parameters. • The JDL files define the input sandbox with all necessary files to be transferred • WN balance load algorithm matches requirements to perform the task optimally. James Werner jamwer2000@hotmail.com
Running analysis programs When the task is delivered in the WN, scripts start running to initialize the specific Babar environment, and the analysis software is downloaded. James Werner jamwer2000@hotmail.com
Benchmarks Behavior of particles in the BaBar Electromagnetic Calorimeter (EMC) • The different behavior of electrons, hadrons, and muons can be distinguished. • Performing this analysis takes 7 days using one computer 24 hours a day. • Using 10 CPUs in parallel, accessed via the Grid, it took only 8 hours. James Werner jamwer2000@hotmail.com
Pi+- N Pi0 decays, with N= 1, 2, 3 and 4 • Invariant masses of pairs of gammas, as measured by the EMC, from Pi0 decay produce a mass peak at 135 MeV (the peak in the plot). All other combinations are spread randomly around all energies (background). • There were 81,700,000 events in the dataset and it took 4 days to run in production, with 26 jobs in parallel: to run it in one single computer would take more than 3 months. James Werner jamwer2000@hotmail.com
Summary • Easygrid is working and provides all job submission structure using LCG grid, RLS and metadata management. • Provides handlers management transparent to the user. • Easy to use !!! • Configurable to achieve user’s requirements and maybe for other experiments as well. • See homepage http://www.hep.man.ac.uk/u/jamwer/ for more details. Thanks for the opportunity! James Werner jamwer2000@hotmail.com