420 likes | 523 Views
C-RORC PRR. ALICE / ATLAS ROS team. Agenda. Introduction ALICE, by H. Engel ATLAS Concluding remarks. Introduction. C-RORC: hardware design of ALICE Types of firmware
E N D
C-RORC PRR ALICE / ATLAS ROS team C-RORC PRR
Agenda • Introduction • ALICE, by H. Engel • ATLAS • Concluding remarks C-RORC PRR
Introduction • C-RORC: hardware design of ALICE • Types of firmware • Test firmware used during production, mainly developed by ALICE, test procedures discussed between ALICE and ATLAS (loopback connector used for tests developed by ATLAS) • ALICE specific • ATLAS specific • RobinNP: C-RORC to become the Gen-III ROS ROBIN • “Dozolar”: data source for 12 S-links, for testing S-link inputs of RobinNP • RoIBuilder: if C-RORC replaces VME based RoI Builder specific firmware may be needed, not excluded that RobinNP firmware can be used C-RORC PRR
C-RORC • C-RORC picture with some explanation Could be removed to improve air circulation (will be discussed later) C-RORC PRR
This review • Production Readiness of C-RORC hardware • First prototypes produced by Cerntech(Hungary) • PCBs from Exception PCB, UK • After tendering production contract was awarded to Hapro (Norway) • PCBs from Suntak, China • 20 pre-production cards under test since mid February C-RORC PRR
Hapro and Cerntech C-RORC • PCB build different, copper balancing on Cerntech board (better spread of heat during manufacturing of board) • Cooler + fan different, Hapro board within PCIe height limit • HaproFPGA: commercial grade (0 – 85 0C), CerntechFPGA: industrial grade (-40 – 100 0C) Cerntech Hapro C-RORC PRR
Pre-Series test at contractor’s siteand tests by ALICE Described in presentation by H. Engel C-RORC PRR
Tests performed by ATLAS • Visual inspection of the pre-series cards • Already mentioned by H. Engel: on one card 3 LEDs only soldered on one side • Fixed by CERN SMD Workshop • Card at Nikhef: some VIAs filled with solder Hapro Cerntech Hapro Cerntech C-RORC PRR
Tests performed by ATLAS I • With RobinNP firmware: • robinnpbistprogram: • Checks register contents • Measures FPGA temperature • Sets clock frequencies for S-Links • On-board memory tests • DMA speed tests • Interrupt tests, including performance benchmarking • Tests of speed and data integrity for page handling and transfers into buffer memory • Temperature measurements, readout via PCIe or via JTAG using Chipscope C-RORC PRR
Tests performed by ATLAS II • Standard data taking environment using ReadoutApplication: • “Indexing” incoming data and managing buffer memory pages • Receiving requests via network from the ROSTesterprogram • Forwarding requests for data to RobinNP • Sending data via network to the ROSTesterprogram • Data generated by internal test generator or by DOLARs or MDT RODs. • For short fragments (50 words) stable running has been seen over periods of 11 hours (limited by ROSTester) • Fragments larger than ~180 words cause a lockup of the firmware for a request fraction of 100% after a short time (10 – 50 s). A logic error in the internal arbitration in the FPGA for access to shared resources is causing this. There is no obvious dependence on features of the C-RORC hardware. A fix for the lockups has been found, consisting of minor (but clearly significant) changes to a couple of state transitions in the Memory Controller's Finite State Machines. The memory is being operated at 303 MHz DDR, it is likely that with more work this can be scaled up C-RORC PRR
Test setups Nikhef: Intel dual CPU server, 2 C-RORCs, 2 dual-port 10 GbE NICs, 1 40 GbE NIC RHUL, single CPU server, 2-C-RORCs, 2 dual port 10 GbE NICs • CERN: • 2 C-RORCs used as Dozolar • 2 GEN-III candidate PCs with 2 RobinNPs each • 1 PC with 3 DOLAR cards C-RORC PRR
Observations • Current for a few cards ~10% higher than for the other cards, but cards do function normally • Boards from Cerntech seem to be less sensitive to air flow • With good air flow and functioning fan temperature of FPGA not a problem (< ~65 0C) C-RORC PRR
C-RORC FPGA Core Temperatures Accuracy FPGA temperature sensor: ± 4 0C RHUL: 2xDDR3@606 Single Rank, 100 MHz oscillator Measurements at RHUL for system without lid Nikhef: 100 MHz oscillator 1 subROB configuration 2 subROBs: ~ +5 0C C-RORC PRR
Infrared photos Hapro C-RORC in machine with Supermicro MB at Nikhef, 4 U high machine with lid open ALICE test firmware RobinNP firmware FPGA sensor: ~ 70 0C FPGA sensor: ~ 64 0C C-RORC PRR
Temperature • High data rates: no significant change of temperature of FPGA • No relation with presence or absence of QFSPs • Reporting and monitoring of fan failure & over temperature: via Ichinga (Nagios), automatic flushing of FPGA configuration to reduce power dissipation. • To be implemented • Discuss common solution with ALICE C-RORC PRR
Identification of cards • DNA id of FPGA: unique number • Hapro serial number printed on PCB of card • ATLAS number • No registration of all QSFPs / memory modules, but ROS team will keep a record of malfunctioning devices C-RORC PRR
Number of cards to be produced • Total, including pre-production: 210 for ATLAS, 170 for ALICE • ATLAS: • Sub-detectors have been asked if they would like to purchase C-RORCs for test setups, deadline for requests: 15 April. Two requests received so far • ATLAS with 210 C-RORCs: about 10% spares + ~10 cards for validation system at CERN and test systems at developer labs • Need for a (small) additional batch of C-RORCs, to be discussed • Plan to have complete Gen-III ROS PCs availableas spares (at least 4, depends on plans with pre-series) C-RORC PRR
Testing upon arrival • Repeat Hapro test on small sample • Subset of Hapro test for all cards (no loopback, no FMC) • robinnpbisttesting with RobinNP firmware • Run a test partition with Dolars or Dozolar sending test data to C-RORC under test and ReadoutApplication and ROSTester programs • After installation of Gen-III ROS PCs run again with test partition and verify that loading new firmware is OK C-RORC PRR
Deployment environment • USA-15 • 2 U high server PCs • 2 C-RORCs + 2 dual port 10 GbE NICs per PC • Purchase contract for PCs not yet awarded, tendering closed, two candidate PCs under test in bdg. 4 C-RORC PRR
S-Link tolerance test • QSFP related: • Set up a ROL between a DOLAR and a RobinNP and measure with a variable attenuator at what attenuation the link starts to fail • LC-MPO fan out will be tested at the same time C-RORC PRR
Schedule slippages • There have been some significant slippages in the schedule. In particular: • Delivery of the Pre-Series C-RORC cards was delayed, initially by a change of FPGA fan (to meet the PCIe thickness spec) and then more significantly by changes in the PCB build requested by the company (NB: without the efforts of Tivadar Kiss these probably could not have been solved). • The RobinNP firmware has taken longer to produce than expected and although it now all exists, there are still issues remaining and a fix has been found for the issue in the buffer handling for full-size fragments from multiple channels, optimization and further checking of the firmware is needed. • Procurement of the GEN-III ROS PCs has been delayed – mainly in getting the tender launched - so that tests of the candidate PCs are only just starting C-RORC PRR
Effect on testing of schedule slippages • Thus not yet able to start a long-duration stability test using pre-series cards in the final configuration • But there is a growing body of evidence from tests by ALICE and ourselves in CERN, at RHUL and NIKHEF that theC-RORC H/W works reliably • Thus we no longer plan to run a long-duration (6-week) stability test prior to the main C-RORC production - the risk by not running the test is small and outweighed by the consequence of the extra delay it would cause C-RORC PRR
Support • 5 years warranty by Hapro • Test setup at CERN for first diagnosis, remote access by experts possible • Test setups at RHUL and Nikhef for further investigations C-RORC PRR
Installation schedule • Boundary conditions: • The ROS system has to be stable and tested by 1 February 2015 • In case of a major problem with the GEN-III re-installing and re-testing the GEN-II H/W takes ~6 weeks C-RORC PRR
Concluding remarks • RobinNP firmware not yet finalized, but to the best of our knowledge there are no hardware related issues • ALICE is happy with starting the production • If we do not start production now the deployment of the Gen III ROS for 2015 is not likely to be possible C-RORC PRR
Backup C-RORC PRR
Cerntech Cerntech Hapro Hapro C-RORC PRR
Test machine at Nikhef: Intel server with S2600CP motherboard C-RORC PRR
Test setup at Nikhef Intel server with 2 C-RORCs VME crate with 12 MRODs and SBC Rack with Gen-I and Gen-II ROS PCs with Dolars and 10 GbE NICs and with E5-1620 based machine with 40 GbE dual-port NIC and 10 GbE NICs C-RORC PRR
Test machine at RHUL with SupermicroX9SRL-F board C-RORC PRR
Machine with SuperMicro MB at CERN Fan may not be optimally positioned for max. air flow over PCIe cards Picture from Supermicro web site, machine at CERN has 1 CPU C-RORC PRR
Test setup configuration (Nikhef) 1 word = 4 Bytes Gen II ROS PC running ROSTester ROS PC running ReadoutApplication 2 x E5-2690 CPU (only 1 CPU used) SLC6 64-bit Gen-I ROS PC Intel 2-port 10 Gb/s NIC PC with E5-1620 CPU DOLAR Cerntech C-RORC Intel 2-port 10 Gb/s NIC Intel 2-port 10 Gb/s NIC DOLAR Hapro C-RORC Intel 2-port 10 Gb/s NIC DOLAR Gen II ROS PC running ROSTester 12 S-links 1 subROB Intel 2-port 10 Gb/s NIC C-RORC PRR 34
Test with fix for lock up 10% readout fraction, 12 x 150 word fragments 4 10**9 events generated: ROSTesterstops C-RORC PRR
Test with fix for lock up 55% readout fraction, 12 x 350 word fragments C-RORC PRR
Test with fix for lock up 55% readout fraction, 12 x 250 word fragments C-RORC PRR
Test with fix for lock up 55% readout fraction, 12 x 200 word fragments C-RORC PRR
Test with fix for lock up 4 10**9 events 45% readout fraction, 12 x 200 word fragments C-RORC PRR
Test with fix for lock up 40% readout fraction, 12 x 200 word fragments C-RORC PRR
Test with fix for lock up 70%*) readout fraction, 12 x 200 word fragments *) 1 ROSTester requesting 100% of fragments, other ROSTester requesting 40% of fragments Slide corrected on 15 April C-RORC PRR
FPGA temperature for test of previousslide C-RORC PRR