10 likes | 113 Views
P33. The DØ Experiment and Run II at the FNAL Tevatron.
E N D
P33 The DØ Experiment and Run II at the FNAL Tevatron The DØ Experiment is one of two Colliding beam detectors located at Fermi National Accelerator Laboratory. It is at one of four collision points for the 1.96 TeV Tevatron proton-antiproton accelerator. The DØ detector underwent numerous upgrades to extend its physics reach before the start of Run II in March 2001. One of the major upgrades was an improvement to the DAQ system (which was rebuilt from scratch), out of which the projects on this page were spawned. Online Monitoring It was realized early on that monitoring the health of the new DAQ system would require a subsystem on its own. Once developed it has been slowly expanding to encompass other online subsystems. 1 The display forms an XML request containing a list of the monitor items and data sources it wants data from. • Extensibility System had to accommodate new monitor items and new data sources through out its life without code changes. • Crash Recovery Any component – data source, display, server – should be able to crash and restart without having to touch any other components in the system. This was a big aid during development! • Uniform Access Wire protocol had to be uniform and simple so that it could slot into preexisting programs. • Scalability Large numbers of displays should not put undo load on the data sources. This was accomplished with caching in the server. • Offset Access Ability to run all displays local to Fermilab and remotely. This means dealing with both latency and security issues. 2 The Monitor Server looks in its internal cache for monitor data that is recent enough to satisfy the request. Only data not in cache is requested in step 3. Monitor Server Reply Monitor Item Request The reply XML is similar to the request:. The request is sent in XML. XML Wrapper 3 The Monitor Server assembles an XML request for a particular data source. The request may ask for several data items. The requests are sent in parallel <monitor_query> <monitor_server> <all_names> <Queries/> </all_names> </monitor_server> <luminosity> <all_names> <d0_luminosity/> <beam_energy/> <pbar_stack_size/> <outside_temperature/> <store_number/> </all_names> </luminosity> </monitor_query> Client-Type: monitor_server Return data from any node that has this client on it 4 The Data Source generates an XML reply that contains all the requested data The monitor item we are interested in 5 The Monitor Server caches all data in the reply and builds a complete reply for the display A Second Client-Type 6 The Monitor Server Sends the XML data back to the Display, which parses the data and displays or otherwise processes the data. Many standard packages can be used by display to parse the monitor data. We’ve used Xerces (open source), MSXML, and roll-your-own. daqAI Architecture daqAI cycles once per second. It runs through a simple set of linear tasks shown at the right. The monitor data is converted to facts and the facts passed to the expert system. Some rules, when they fire, call embedded procedures to request log entries, run timers, set persistent facts, or send a command to the DAQ system. After all the rules have been run, these requests are acted upon. The expert system retains no memory of any previous run; it starts fresh each time. This was a design decision (it is also difficult to retract a fact). Thus, if a problem occurs the rule system will try to fix it every second; even if the fix requires 4 or 5 seconds. Thus, only new requests for commands or log messages are acted upon. The daqAI rule script currently contains 77 rules. The rules will assign DAQ dead time to approximately 8 different sources as well as try to fix 5 of them. A single iteration takes about ½ second in real time. Gather Monitor Data Display Display request request reply reply Convert to Facts Display Handler Thread Display Handler Thread Single Threaded Query Processor Add Persistent and Timer Facts Process All Rules until no more Fire Data Cache Source Handler Thread Source Handler Thread Process new log messages, commands, persistent facts. CLIPS An Embeddable Expert System These Facts then Fire Rules: (defrule log_l2scli_missing “Missing Muon SLIC" (s_slic_missing_input ?ct ?channel) (not (problem_reason ?w)) => (log_reason "L2 SLIC Input Missing (crate 0x“ (tohex ?ct) ", channel 0x" (tohex ?channel) ")") (talk "slick input missing. Muon Crate " (tohex ?ct) ".") (issue_coor_request "scl_init")) A more complex rule. If the s_slic_missing_input fact is present, and the problem_reson fact isn’t, then downtime is assigned to the muon SLIC, the control room is notified, and the DAQ system is sent an init command to resync its timing. http://www.ghg.net/clips/CLIPS.html (defrule daq_is_active “DAQ Configured?" (l3xetg-n_runs ?num&:(!= ?num 0)) => (assert (b_daq_is_active ?num))) Data Source Data Source Monitor Data is used to defines facts: (l3xetg-n_runs 1) (l3xetg-rate_events 82.677) (muon_ro_crate 20 "running OK") (muon_ro_crate 22 "running OK") If the fact l3xetg-n_runs n, where n is not zero is present, then this rule will fire. It will assert a new fact – namely b_daq_is_active n. This can then be used in other rules. B. Angstadt, G. Brooijmans, D. Chapin, D. Charak, M. Clements, S. Fuess, A. Haas, R. Hauser, D. Leichtman, S. Mattingly, A. Kulyavtsev, M. Mulders, P. Padley, D. Petravick, R. Rechenmacher, G. Watts, D. Zhang DØ Online Monitoring And Automatic DAQ Recovery For DØ and L3 Home Page: http://www-d0.fnal.gov http://www-d0online.fnal.gov/ www/groups/l3daq/ For daqAI Home Page: http://d0.epe.phys.washington.edu/ Projects/daqAI/ Monitor System Data Flow Display Data Source Monitor Server Data Source Display 6 2 5 Automatic DAQ Recovery 3 Display There are moments in control room when shifters, addressing the same malfunction for the 10th time in a row, swear “Why can’t the computer do this.” The daqAI project attempts to address these sorts of problems. 4 Data Source 1 Monitor Data daqAI is a display of the monitor system (see figures on right). The Data Format: XML daqAI uses a facts and rule based inference engine to detect problems. The monitor data is scanned for problems once a second. After some debate of binary vs text, we settled on text and then XML. The main reason was the ease with which one could add new monitor items to an already existing structure. The panels to the right show a simple XML request and reply format. Pattern Matching What is a monitor Item? Based on what it sees and the rule’s actions, daqAI can issue simple commands to the run control system or speak to the shifters using a synthesized voice. See the CLIPS section below A monitor item is supplied by a data source and requested by a display. Each monitor item can be uniquely addressed by the tuple (computer-name, client-type, item-name). For example, the luminosity for the DØ experiment is represented by (d0l3mon2.fnal.gov, luminosity, d0_luminosity). Affect Changes Adding a New Problem to daqAI • Once a problem for daqAI to handle has been identified, the steps to implementing it are fairly simple. • Make sure the monitor data is available to uniquely differentiate this problem. In our current system this often means adding new data sources or monitor items. • If the problem is to be fixed by daqAI, make sure the commands to affect the change are accessible. This is sometimes a hard pill for a detector group to swallow: giving up control to an automated system. • Write the CLIPS rules to identify the problem, assign a downtime reason, and, perhaps, issue commands to make the fix. Also, communicate with shifters. This is especially crucial when issuing commands. Many Copies of one Data Source For example, the l3xnode client runs as part of every Level 3 Trigger Node, and so, runs on almost 100 different nodes. The query format allows displays to request a single monitor item from all clients. Complex Data Performance Monitor data can be of arbitrary complexity, formatted in XML or straight text. Binary would not work. If XML control characters need to be included, then the XML escape clause can be used: <[CDATA[data-goes-here]]>. We encourage the use of structured XML data as automatic parsing tools can extract data without specialized encoding. • 160 Data Sources, 30 Displays • Distributes 1 MB/second of monitor Information • 50% cache hit rate • 10% of a dual CPU Linux Node • 4% of 512 MB RAM Assigning Down Time Getting Data Into the System daqAI was originally designed to fix problems it found. However, once running it was realized it could also classify downtime even if it did nothing to fix the cause of the downtime. The result is the below shift report: The monitor system is only as powerful as the data in it is good. How hard it is to add a new data source or data item depends upon how much work one wants to do. There are C++ template container objects that allow one monitor an item with a single declarative line. The following C++ will create a new monitor item called count and set it to 5. Any queries from the monitor server for count will then return 5: How Successful is daqAI? daqAI Shift report covering the period 2003-03-19-16:00:00 to 2003-03-20-00:00:00 5 times 'Bad A/O BOT Rate in term(193): muon' took 00:13:33 (162.6 secs on avg) 1 times 'Crate 0x67 is FEB' took 00:00:20 (20 secs on avg) 1 times 'L2 Crate 0x21 lost sync' took 00:00:07 (7 secs on avg) 2 times 'Muon Crate 0x30 has fatal error' took 00:00:25 (12.5 secs on avg) 7 times 'Other' took 00:06:18 (54 secs on avg) Timer Status: Timer in_store: 06:36:34 (off) Timer l3_configured: 06:30:28 (off) 98.4618% of in_store Timer daq_configured: 06:22:43 (off) 96.5075% of in_store 98.0152% of l3_configured Timer good_data_flow_to_l3: 06:16:21 (off) 94.9021% of in_store 96.3847% of l3_configured 98.3365% of daq_configured Data Flow in the Monitor Server daqAI helped increase the DØ DAQ live time from 75% to about 85%. This was mostly because it could detect and respond to frequent problems much faster than a human shifter. That isn’t to say there weren’t difficulties encountered: • Problem symptoms can change, which requires code updates. • Each new problem requires significant changes to the rules code. • There can be feed back loops, especially if there is a faulty monitoring source. l3_monitor_util_reporter_ACE monitor_callback ("tester");l3_monitor_object_op<int> counter ("count");counter = 5; We have similar code for python as well. Probably the most important things for an upcoming experiment are to create a centralizedcontrol and monitoring system. This will greatly aid in developing systems like this. Getting Data Out of the System We also have code to get data out of the system. We have frameworks for C++, python, java, http, and C#. It is also possible to program directly to the TCP/IP protocol. The following python code will request (error checking removed): Problem identification may be able to benefit by an automatic data classification algorithm. All the data is maintained, long term, in an Oracle DB… How long to the next CHEP? disp = l3xmonitor_util_module.monitor_display()test_item = disp.get_item("tester", "count")disp.query_monitor_server()print "Count value is %s" % test_item[0] Security The DØ online system is a critical system: declared inaccessible to the outside. We petitioned for a single hole through the firewall and run a repeater on the other side. This allows us to view monitoring data offsite and has been crucial for the debugging of the L3/DAQ system.