200 likes | 216 Views
This update includes the release of version 5.1.1c running in production, fixes in CdfMetModule.cc, CprClusterMaker.cc, CprWireCollectionMaker.cc, PlugStripMaker.cc, PlugStripClusterMaker.cc, and KalZ3DVertexFinder.cc, as well as resolved crashes and errors.
E N D
CDF Offline Operations • Status: • 5.1.1c running in Production : • Remote database/monitor logging turned of • Fix in CdfMetModule.cc. Check for multiply deletes. • -1 Events gone ! • Fixed uninitialised variables in: • CprClusterMaker.cc • CprWireCollectionMaker.cc
5.1.1c_maxopt • Got rid of severe error messages in : • PlugStripMaker.cc • PlugStripClusterMaker.cc • Found infinite loop in • KalZ3DVertexFinder.cc (Kurt and Thorsten) for (unsigned l3=l2+1; l3<l1; ++l3) { double leastdist = 1.0e10; int nearest = -1; for (unsigned int kh=0; kh< layerList[l3].size(); ++kh) { hit3 = layerList[l3][kh]; zsearch = hit2->z() + (hit3->r()-hit2->r())* (hit1->z() - hit2->z())/(hit1->r() - hit2->r()); if(fabs(hit3->z() - zsearch)<=leastdist){ leastdist=fabs(hit3->z() - zsearch); nearest=kh; } } } • All other crashes (>95%) duplicate events.
Hang and Crash • 0x8de1be5 in SimpleExtrapolatedTrack::helixZ (this=0xbfff9510,zCoord=185.39999389648438) • at /home/cdfsoft/dist/packages/ElectronObjects/V00-00-70/src/SimpleExtrapolatedTrack.cc:356 • 356 while (_phi > 2.0*M_PI) { _phi -= 2.0*M_PI; } • (gdb) where • #0 0x8de1be5 in SimpleExtrapolatedTrack::helixZ (this=0xbfff9510, zCoord=185.39999389648438) at /home/cdfsoft/dist/packages/ElectronObjects/V00-00-70/src/SimpleExtrapolatedTrack.cc:356 • #1 0x8ddef11 in SimpleExtrapolatedTrack::extrapolateZ (this=0xbfff9510, zCoord=185.39999389648438) at /home/cdfsoft/dist/packages/ElectronObjects/V00-00-70/src/SimpleExtrapolatedTrack.cc:204 • #2 0x8d9c9db in CdfEmObject::maxPtTrack (this=0xd791d3c__T165106692=0xbfff9ce0) at /home/cdfsoft/dist/packages/ElectronObjects/V0-0070/src/CdfEmObject.cc:767 • (gdb) p _phi • $1 = 6.7514747645567823e+28 • Bob and Beate
Valgrind • Run valgrind over the other crashes: ==18449== Conditional jump or move depends on uninitialised value(s) ==18449== at 0x420A6879: __mktime_internal (in /lib/i686/libc-2.2.5.so) ==18449== by 0x420A6EBE: timelocal (in /lib/i686/libc-2.2.5.so) ==18449== by 0x9B0D0C1: DateUtil::time_from_string(char const *) (/home/cdfsoft/dist/packages/DBObjects/V00-00-72/src/TimeStamp.cc:264) ==18449== by 0x904C794: ChipStatus::__ct(std::basic_string<char,std::char_traits<char>,std::allocator<char>>, int) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:54) ==18449== by 0x8F94AE5: PedestalUpdator::changed(void) (/home/cdfsoft/dist/packages/SvxDaqObjects/V00-0074/src/PedestalUpdator.cc:226) • Other: (Jason) ==18449== Conditional jump or move depends on uninitialised value(s) ==18449== at 0x904EFBB: ChipStatus::putBit(char *, int, int) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:133) ==18449== by 0x904F372: ChipStatus::sortBitString(int, int, char *) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:252) ==18449== by 0x904EC15: ChipStatus::makeMap(int) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:212) ==18449== by 0x904C8CC: ChipStatus::__ct(std::basic_string<char,std::char_traits<char>,std::allocator<char>>, int ) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:67) ==18449== by 0x8F94AE5: PedestalUpdator::changed(void) (/home/cdfsoft/dist/packages/SvxDaqObjects/V00-00-74/src/PedestalUpdator.cc:226)
Valgrind • Still there (1X) (Aseet) ==6977== Conditional jump or move depends on uninitialised value(s) ==6977== at 0x914484D: PadSqz::Huffman_T::operator<<( (PadSqz::BitStream_T &)) (/home/cdfsoft/dist/packages/PADSObjects/V00-00-23/src/Huffman.cc:368) ==6977== by 0x9145E4C: PadSqz::PadRawBank::Fluff( (int)) (/home/cdfsoft/dist/packages/PADSObjects/V00-00-23/src/PadRawBank.cc:173) ==6977== by 0x84CF42C: PadRawModule<PadSqz::COTQ>::event(EventRecord *) (/home/cdfsoft/dist/releases/5.1.1/include/PADSMods/PadRawModule.icc:57)
Valgrind • Valgrind error in DB ==4539== Invalid read of size 2 ==4539== at 0x40705BBC: lxpe2i (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x406F83A5: lxhci2h (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x405E9899: ttclxr (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x403A6217: OCISessionBegin (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x9B1918B: otl_connect::rlogon(char const *) (/home/cdfsoft/dist/packages/DBObjects/V00-00-72/src/otl/utilsOTL.cc:420) ==4539== by 0x9B14B12: OTLCon::getConnection(void) (/home/cdfsoft/dist/packages/DBObjects/V00-00-72/src/otl/dbOTL.cc:328) ==4539== by 0x9AEB5FC: OTLDriverInfo::checkConnection(void) (/home/cdfsoft/dist/packages/CalibDB/V00-00-85/src/OTL/OTLDriverInfo.cc:95) ==4539== by 0x97C2A39: PASSESOTL::doGet(std::basic_string<char,std::char_traits<char>,std::allocator<char>> const &, std::vector<PASSES,std::allocator<PASSES>> *&) (/home/cdfsoft/dist/releases/5.1.1/tmp/Linux2-KCC_4_0/DBViews/PASSES.OTL.cc:106) ==4539== Address 0x57AFEE62 is 2 bytes after a block of size 200 alloc'd
DB Error messages • ==19003== 1420 bytes in 5 blocks are still reachable in loss record 76 of 105 • ==19003== at 0x40166BA0: malloc (vg_clientfuncs.c:103) • ==19003== by 0x4044B13F: ntpaini (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.o.8.0) • ==19003== by 0x4044AFEF: ntgblini (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) • ==19003== by 0x40432BEA: nsgblini (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.o.8.0) • ==19003== by 0x4035A7DF: kpuatch (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.o.8.0) • ==19003== by 0x403A61C7: OCIServerAttach (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) • ==19003== by 0x9B18FEF: otl_connect::rlogon(char const *)(/home/cdfsoft/dist/packages/DBObjects/V00-0072/src/otl/utilsOTL.cc:367) • ==19003== by 0x9B14B12: OTLCon::getConnection(void) (/home/cdfsoft/dist/packages/DBObjects/V00-00-72/src/otl/dbOTL.cc:328)
Daily checking • New cron job checks in log files for severe errors every hour. • Found usual problems: • %ERLOG-s : *Fluffed bank(s) != original(s) PadRawBanks • %ERLOG-s L3 Trigger Bits not in event: no Level3Results or TL3D run = 159288 event = 1033557 • %ERLOG-s ROOT/TFile:error writing to file ./JET_CALIB_18651_temp_0 (No space left on device) JET_CALIB:write failed, event not written. • %ERLOG-s CalDataMaker: unpack HATD bank : more than 8 hits in WHA (changed TDCs)
Farms • Farms are running out of diskspace • Bad for Stream G(13 output streams) compared to C(3 output streams).
Farms • 10 nodes hangup every day • Over 25 over the weekend • Running out of diskspace for concatenation.
Production • Statistics of reprocessing with EXE: 5.1.1_maxopt • ==================================================== • To be processed processed last day today total • Stream a 20521173 0 0 0 • Stream b 80915268 0 0 0 • Stream c 57487182 0 0 57180498 • Stream d 35100306 0 0 0 • Stream e 67452861 0 0 0 • Stream g 101170413 4674100 1813007 78111329 • Stream h 155508683 0 0 0 • Stream j 70459709 0 0 0 • --------------------------------------------------------------------------------------------- • Total : 588615595 4674100 1813007 135291827
History Stream C Stream G
Meeting • Meeting on Monday with CDF farms • Many ideas to hangups ( No real hint) • Power distribution • Temperature • Network • Linux kernel • … • Immediate solution reboot machines automatically • Allready monitoring each node every 10 min. • Try to get fbs log files
Plans • Before the end of this week: • Steve Timm's group will deploy the autoreboot for hanged nodes. • This will run once a day, probably at midnight, as a cron job. • Suen et al. will figure out how to increase the space available to • dfarm. • Steve Timm's group already has implemented a way of saving the CDF code status when a node hangs. I.e. fbsng no longer cleans it all up before we can take a look at it. • They will provide CDF with some examples so that we can try to figure out what might trigger this in the CDF software.
Plans • Farms history: • CDF requested a list of dates when significant upgrades to the farms OS (or dfarm) were made. • This list should go back to May 2003. CDF will try to do a statistical analysis of hangs vs OS etc. • A hang is defined as a software failure on OSS's uptime web page information.
Plans • Early next week, we will add the 3 fileservers fcdfdata053,55,57 to the production farm in order to get more stable operating conditions. The nodes need to be physically moved from FCC1 to FCC2 because of networking issues. Space & power needs to be found. • The goal in this is to increase the chances that at least 1 copy of each file in dfarm is always accessible, even if many nodes hang.
Data taking • Soon new data. Preparing for it. • Cosmic runs processed.