90 likes | 195 Views
Re: Jobs reading wrong input files. From: <Costin.Grigoras@cern.ch>. Initial reports. Detailed reports from Redmer , Ruediger , Marta and others after the winter break Wrong folders in the analysis train results Jobs saving the output in the wrong place
E N D
Re: Jobs reading wrong input files From: <Costin.Grigoras@cern.ch>
Initial reports Detailed reports from Redmer, Ruediger, Marta and others after the winter break • Wrong folders in the analysis train results • Jobs saving the output in the wrong place • Jobs analyzing more files than indicated in their input data list ALICE Weekly offline meeting: Jobs reading wrong input files
Fix for the apparent problem • Several jobs were assigned the same access token in the task queue • The random number generator is not correctly initialized in AliEn, starting with the same sequence for each new thread • A fix was deployed by Miguel and since then there are no duplicate tokens any more • JobAgent code was also hardened to make sure the jobs run in their expected sandboxes • However the problem at hand was not solved by this ALICE Weekly offline meeting: Jobs reading wrong input files
Next suspect // connect to AliEn and make the chain if (!TGrid::Connect("alien://")) return; // Set temporary merging directory to current one gSystem->Setenv("TMPDIR", gSystem->pwd()); … // load base root libraries gSystem->Load("libTree"); … // Add aditionalAliRoot libraries gSystem->Load("libPWGflowBase.so"); gSystem->Load("libPWGflowTasks.so"); … // read the analysis manager from file AliAnalysisManager *mgr = AliAnalysisAlien::LoadAnalysisManager("lego_train.root"); if (!mgr) return; mgr->PrintStatus(); AliLog::SetGlobalLogLevel(AliLog::kError); TChain *chain = CreateChain("wn.xml", anatype); mgr->StartAnalysis("localfile", chain); … ALICE Weekly offline meeting: Jobs reading wrong input files
strace of the same train ... [pid 18747] open("wn.xml", O_RDONLY) = 44 [pid 18747] open("/tmp/aliencollection.fdaf55aa-8538-11e3-9717-0101007fbeef", O_RDWR|O_CREAT, 0644) = 46 [pid 18747] open("/tmp/aliencollection.fdaf55aa-8538-11e3-9717-0101007fbeef", O_RDONLY) = 44 ... ... [pid 19037] open("wn.xml", O_RDONLY) = 44 [pid 19037] open("/tmp/aliencollection.2d5b112c-8539-11e3-9717-0101007fbeef", O_RDWR|O_CREAT, 0644) = 46 [pid 19037] open("/tmp/aliencollection.2d5b112c-8539-11e3-9717-0101007fbeef", O_RDONLY) = 44 ... ALICE Weekly offline meeting: Jobs reading wrong input files
So it seems to come from • TAlienCollection • Using /tmpexplicitly • Always “downloading” wn.xml to it, even if it is local • Using a time-based UUID as file name suffix • 100 ns granularity • Jobs on the same worker node synchronizing on downloading the libraries via CVMFS ALICE Weekly offline meeting: Jobs reading wrong input files
Scope of the problem • Comparing the contents of fileinfo.log and the JDLs of the last 2mo trains: • 838 trains, 41.2K masterjobs, 3.02M subjobs: • 659 affected subjobs (0.022%) • Only the initial analysis could be assessed since fileinfo.log is not available for other types of jobs • Though the pattern should be similar • This was also affecting the merging, causing the reported crashes and the unexpected content ALICE Weekly offline meeting: Jobs reading wrong input files
Solution for this issue • Andrei has committed a fix for this in ROOT • Until the next tag is available Alina will put the patch in the build system and use it for the next AliRoot tags • For the moment the extra debugging added to the train logs will be kept • A continuous checking of the fileinfo.log consistency with the JDLs will be implemented ALICE Weekly offline meeting: Jobs reading wrong input files
Further on • /tmp is used in other places as well • In particular the train used for debugging does some 400Hz* of /tmp file creation • Andrei was volunteered #2 0x00007f1bab9c0fa4 in __new_tmpfile () at tmpfile.c:47 #3 0x00007f1baaefa2b1 in G__process_cmd () …/v5-34-08/lib/libCint.so #4 0x00007f1bac8d7dc6 in TCint::ProcessLine(char const*, TInterpreter::EErrorCode*) () …/v5-34-08/lib/libCore.so #5 0x00007f1bac82702a in TApplication::ProcessLine(char const*, bool, int*) () …/v5-34-08/lib/libCore.so #6 0x00007f1bac8741b9 in TROOT::ProcessLine(char const*, int*) () …/v5-34-08/lib/libCore.so #7 0x00007f1bac83e8fb in TDirectory::CloneObject(TObject const*, bool) () …/v5-34-08/lib/libCore.so #8 0x00007f1b97128261 in AliFemtoEventReaderAOD::CopyAODtoFemtoTrack (this=0x5100e20, tAodTrack=0x153cc980, tFemtoTrack=0x15f79dc0) at …/AliRoot-v5-05-61-AN/PWGCF/FEMTOSCOPY/AliFemto/AliFemtoEventReaderAOD.cxx:859 ALICE Weekly offline meeting: Jobs reading wrong input files