1 / 9

Re: Jobs reading wrong input files

This report discusses the issue of jobs in ALICE reading wrong input files and provides a solution to fix the problem. It also highlights the scope of the problem and future steps to prevent it.

cortney
Download Presentation

Re: Jobs reading wrong input files

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Re: Jobs reading wrong input files From: <Costin.Grigoras@cern.ch>

  2. Initial reports Detailed reports from Redmer, Ruediger, Marta and others after the winter break • Wrong folders in the analysis train results • Jobs saving the output in the wrong place • Jobs analyzing more files than indicated in their input data list ALICE Weekly offline meeting: Jobs reading wrong input files

  3. Fix for the apparent problem • Several jobs were assigned the same access token in the task queue • The random number generator is not correctly initialized in AliEn, starting with the same sequence for each new thread • A fix was deployed by Miguel and since then there are no duplicate tokens any more • JobAgent code was also hardened to make sure the jobs run in their expected sandboxes • However the problem at hand was not solved by this ALICE Weekly offline meeting: Jobs reading wrong input files

  4. Next suspect // connect to AliEn and make the chain if (!TGrid::Connect("alien://")) return; // Set temporary merging directory to current one gSystem->Setenv("TMPDIR", gSystem->pwd()); … // load base root libraries gSystem->Load("libTree"); … // Add aditionalAliRoot libraries gSystem->Load("libPWGflowBase.so"); gSystem->Load("libPWGflowTasks.so"); … // read the analysis manager from file AliAnalysisManager *mgr = AliAnalysisAlien::LoadAnalysisManager("lego_train.root"); if (!mgr) return; mgr->PrintStatus(); AliLog::SetGlobalLogLevel(AliLog::kError); TChain *chain = CreateChain("wn.xml", anatype); mgr->StartAnalysis("localfile", chain); … ALICE Weekly offline meeting: Jobs reading wrong input files

  5. strace of the same train ... [pid 18747] open("wn.xml", O_RDONLY) = 44 [pid 18747] open("/tmp/aliencollection.fdaf55aa-8538-11e3-9717-0101007fbeef", O_RDWR|O_CREAT, 0644) = 46 [pid 18747] open("/tmp/aliencollection.fdaf55aa-8538-11e3-9717-0101007fbeef", O_RDONLY) = 44 ... ... [pid 19037] open("wn.xml", O_RDONLY) = 44 [pid 19037] open("/tmp/aliencollection.2d5b112c-8539-11e3-9717-0101007fbeef", O_RDWR|O_CREAT, 0644) = 46 [pid 19037] open("/tmp/aliencollection.2d5b112c-8539-11e3-9717-0101007fbeef", O_RDONLY) = 44 ... ALICE Weekly offline meeting: Jobs reading wrong input files

  6. So it seems to come from • TAlienCollection • Using /tmpexplicitly • Always “downloading” wn.xml to it, even if it is local • Using a time-based UUID as file name suffix • 100 ns granularity • Jobs on the same worker node synchronizing on downloading the libraries via CVMFS ALICE Weekly offline meeting: Jobs reading wrong input files

  7. Scope of the problem • Comparing the contents of fileinfo.log and the JDLs of the last 2mo trains: • 838 trains, 41.2K masterjobs, 3.02M subjobs: • 659 affected subjobs (0.022%) • Only the initial analysis could be assessed since fileinfo.log is not available for other types of jobs • Though the pattern should be similar • This was also affecting the merging, causing the reported crashes and the unexpected content ALICE Weekly offline meeting: Jobs reading wrong input files

  8. Solution for this issue • Andrei has committed a fix for this in ROOT • Until the next tag is available Alina will put the patch in the build system and use it for the next AliRoot tags • For the moment the extra debugging added to the train logs will be kept • A continuous checking of the fileinfo.log consistency with the JDLs will be implemented ALICE Weekly offline meeting: Jobs reading wrong input files

  9. Further on • /tmp is used in other places as well • In particular the train used for debugging does some 400Hz* of /tmp file creation • Andrei was volunteered  #2 0x00007f1bab9c0fa4 in __new_tmpfile () at tmpfile.c:47 #3 0x00007f1baaefa2b1 in G__process_cmd () …/v5-34-08/lib/libCint.so #4 0x00007f1bac8d7dc6 in TCint::ProcessLine(char const*, TInterpreter::EErrorCode*) () …/v5-34-08/lib/libCore.so #5 0x00007f1bac82702a in TApplication::ProcessLine(char const*, bool, int*) () …/v5-34-08/lib/libCore.so #6 0x00007f1bac8741b9 in TROOT::ProcessLine(char const*, int*) () …/v5-34-08/lib/libCore.so #7 0x00007f1bac83e8fb in TDirectory::CloneObject(TObject const*, bool) () …/v5-34-08/lib/libCore.so #8 0x00007f1b97128261 in AliFemtoEventReaderAOD::CopyAODtoFemtoTrack (this=0x5100e20, tAodTrack=0x153cc980, tFemtoTrack=0x15f79dc0) at …/AliRoot-v5-05-61-AN/PWGCF/FEMTOSCOPY/AliFemto/AliFemtoEventReaderAOD.cxx:859 ALICE Weekly offline meeting: Jobs reading wrong input files

More Related