90 likes | 192 Views
DCC Out of Sync Problems Stan Durkin, Ohio State. In Recent High Rate Cosmic Runs (July 18-23, 2010) DCCs have gone into an Out-of-Sync Condition 7 times FMM 750 W 82 B 28 S 1 E 0 FMM 752 W 0 B 0 S 0 E 0 FMM 754 W 1023 B 14 S 6 E 0
E N D
DCC Out of Sync Problems Stan Durkin, Ohio State
In Recent High Rate Cosmic Runs (July 18-23, 2010) DCCs have gone into an Out-of-Sync Condition 7 times FMM 750 W 82 B 28 S 1 E 0 FMM 752 W 0 B 0 S 0 E 0 FMM 754 W 1023 B 14 S 6 E 0 FMM 756 W 107 B 33 S 2 E 0 Analyze Study Run 141291 (specifically 490s to 540 s) 4,230,000 events thru each RUI 5102 events on CMSSW data ~0.1 % of events saved Rate (from slopes): 79.5 KHz L1As Time (seconds)
DCC FIFO Overflows at High Data Rates CSC DCC &&DDU header have FMM information SLINK FIFO 1MB Input_FIFO 248KB
CSC DCC sTTS state machine: SLINK_FIFO goes to Half_Full set WARNING; SLINK_FIFO reset WARNING when drop back to Almost_Empty; IN_FIFO goes to Half_Full and L1A Buffer in WARNING, set BUSY; IN_FIFO goes to Half_Full, but SLINK_FIFO not in WARNING, set WARNING; IN_FIFO stays Half_Full for more than 3.2ms, set BUSY; IN_FIFO reaches Almost_Full, set Out_Of_Sync; IN_FIFO or SLINK_FIFO reaches Full, set Out_of_Sync; L1A Buffer: >1536: set WARNING, reset WARNING when it drop to 1280; L1A Buffer: >1920: set BUSY, reset BUSY when it drop to 1536; L1A Buffer: >2016: set Out_Of_Sync; - Warning and Busy Stops L1A Triggers (lacency ~1sec) - Out_of_Sync stops run for a resync
FMM Throttling Seems to be Working Time FMM 1 Asserted FMM Log 141491 t(s) dt(s) FMM 139.384429875 0.436721600 1 139.386232725 0.001802850 8 140.119162225 0.732929500 1 140.120998750 0.001836525 8 144.130565900 4.009567150 1 144.132397975 0.001832075 8 146.057188825 1.924790850 1 146.058872650 0.001683825 8 148.779290350 2.720417700 1 148.781143125 0.001852775 8 152.496441950 3.715298825 1 152.498013425 0.001571475 8 152.817810300 0.319796875 1 152.819979975 0.002169675 8 153.590204650 0.770224675 1 153.592016100 0.001811450 8 154.189867650 0.597851550 1 154.191494650 0.001627000 8 … repeats 90 times … 191.300884525 0.001097700 8 191.301140075 0.000255550 1 191.303430625 0.002290550 2 1.8 msec Time (msec) Transition FMM 12 2.290±0.005 msec
Data Rates aren’t Large Enough to be Causing Overflows Theoretical Probability of >50 events in Queue Average Event Sizes RUI 750 884 bytes RUI 751 993 bytes RUI 752 861 bytes RUI 753 1129 bytes RUI 754 843 bytes RUI 755 1163 bytes RUI 756 821 bytes RUI 757 988 bytes 78.5 Khz ~78.5 MB/s Log10(P)*106 Rate (MB/s) To Fill SLINK FIFO in 2.29 msec requires >200 MB/s even if output stopped 600 MB/s 480 MB/s SLINK FIFO 1 Mbyte
60 Events in Run 141491 CMSSW data show bad transmission 3.2 GB/s 3.2GB/s Two independent 3.2 Gbit links 1960 826d bc50 bc50 0000 8000 bc50 bc50 0080 0000 bc50 bc50 8000 8000 bc50 bc50 0000 0000 bc50 bc50 0080 2c1e bc50 bc50 c0de c000 bc50 bc50 1560 826d 6d0f 5080 0000 8000 0001 8000 0080 0000 1014 3f7f 8000 8000 ffff 8000 0000 0000 0000 2000 0080 2210 0006 a000 Bad data, 0xBC50 idle code Good Data Transfer problem On 3.2 Gbit Backplane
How do we prove these events are causing problem ? last column shift f308 7342 76b2 5164 01f0 5ae0 0e36 d900 1960 734d 5064 c0de 0000 8000 8000 76b2 0080 0000 3f7f 0001 8000 8000 8000 1014 0000 0000 2000 ffff 0080 be16 a000 0000 c0de c000 c000 0006 1960 86bd 5064 c0de 0000 8000 8000 76b3 0080 0000 3f7f 0001 8000 8000 8000 1014 0000 0000 2000 ffff 0080 2a10 a000 0000 c0de c000 c000 0006 1960 916d 5064 c0de 0000 8000 8000 76b4 0080 0000 3f7f 0001 8000 8000 8000 1014 0000 0000 2000 ffff 0080 5039 a000 0000 c0de c000 c000 0006 1960 960d 5064 c0de Viewed several hundred bad transmission events. Only a small number of DDU->DCC links gave problems. RUI755 DDU 25 most RUI757 DDU 33 many RUI751 DDU 7 a few RUI751 DDU 3 a few RUI756 DDU 35 one RUI755 DDU 16 one We will swap DDU 25 and see if the problems go away.
Possible Remedies to Problem • Fix problem boards • Reconfigure XILINX RocketIOS • Channel Bonding – lock step data transmissions • 16 bit -> 32 bit transfers – keep data packets together • Change Clock Frequency in Firmware (divide by 2) • we don’t need 800 Mbyte/s • This is not urgent. We will proceed with caution.