230 likes | 338 Views
Control in ATLAS TDAQ. Dietrich Liko on behalf of the ATLAS TDAQ Group. Overview. The ATLAS TDAQ System Dataflow & HLT Control Subsystem of the Online Software Architecture TDAQ Wide Run Control Group Technology Choice CLIPS Design & Implementation Expert System Framework
E N D
Control in ATLAS TDAQ Dietrich Liko on behalf of the ATLAS TDAQ Group
Overview • The ATLAS TDAQ System • Dataflow & HLT • Control Subsystem of the Online Software • Architecture • TDAQ Wide Run Control Group • Technology Choice • CLIPS • Design & Implementation • Expert System Framework • Run Control, Supervision & Verification • Testing & Verification • Test beam • Scalability Tests Control of the ATLAS TDAQ system
Dataflow ROD ROS LVL1 HLT LVL2 Event Filter Online System Operation DCS Detector control Test beam: see [331] Event Building Performance: see [217] The ATLAS TDAQ System Control of the ATLAS TDAQ system
Control Aspects • Dataflow • Fixed configuration • Synchronization, classical Run Control • Error handling • High level Triggers • Flexible configuration • Synchronization • Error Handling Control of the ATLAS TDAQ system
ATLAS Online Software • Component Architecture • Object Oriented, C++ and Java • Distributed system (CORBA) • XML for Configuration • Specialized services for a TDAQ system • Information sharing, Message Reporting, Configuration • Iterative Development Model • Prototype already in use • Laboratories, Test beam, Scalability tests • Evolvement into the systems for initial ATLAS system Control of the ATLAS TDAQ system
Online Software Architecture • In the context of the iterative development cycle and the Technical Design Review • Reevaluation of requirements and architecture • Several high level packages & corresponding subsystems • Control • Supervision, Verification • Databases: see [130] • Configuration, Conditions • Information Sharing: see [166] • Information Service, Message Service, Monitoring Control of the ATLAS TDAQ system
Control Subsystem In the following only the Supervision subsystem is discussed Control of the ATLAS TDAQ system
Supervision • The Initialization and Shutdown is responsible for: • initialization of TDAQ hardware and software components; • re-initialization of a part of the TDAQ partition when necessary; • shutting the TDAQ partition down gracefully; • TDAQ process supervision. • The Run Control is responsible for • controlling the Run by accepting commands from the user and sending commands to TDAQ sub-systems; • analyzing the status of controlled sub-systems and presenting the status of the whole TDAQ to the Operator • The Error Handling is concerned with • analyzing run-time error messages coming from TDAQ sub-systems; • diagnosing problems, proposing recovery actions to the operator, or performing automatic recovery if requested. Control of the ATLAS TDAQ system
TDAQ Wide Run Control group • Examines the requirements from the subsystem side • Dataflow, HLT • Hierarchical concept • Follows the overall organization of the TDAQ system • Controller central element • All control functionality in combined controller • State machine concept for synchronization • Flexibility in error handling • User customization Control of the ATLAS TDAQ system
Initial Design & Technology Choice • A Run Control implementation is based on a State Machine model and uses the State Machine compiler, CHSM, as underlying technology. • P.J. Lucas, An Object-Oriented language system for implementing concurrent hierarchical, finite state machines, MS Thesis, University of Illinois, (1993) • A Supervisor is mainly concerned with process management. It has been built using the Open Source expert system CLIPS • CLIPS, A tool for building expert systems,http://www.ghg.net/clips/CLIPS.html • A Verification system (DVS) performs tests and provides diagnosis. It is also based on CLIPS. Control of the ATLAS TDAQ system
Experiences • PLUS • Scalability test in 2002 demonstrated that a system of the size of ATLAS TDAQ system can be controlled • MINUS • Lack of flexibility (CHSM) Control of the ATLAS TDAQ system
Technologies • CLIPS • Production system, standard open source expert system • So-called Rete algorithm drives the evaluation rules on a set of facts • In house experience • General purpose scripting language, OO features • C language bindings • Alternatives • Jess: Java based, very similar to CLIPS • Eclipse: Commercial evolution of CLIPS • SMI++ • State Machine • No general purpose scripting language • Difficult to integrate in our environment • Python • Excellent scripting language • No expert system Control of the ATLAS TDAQ system
Design & Implementation • General Framework embedding CLIPS in a CORBA server • Periodic evaluation of knowledge base • Extension mechanism • Online Software Components embedded as plug ins • Control functionality fully described by CLIPS rules Control of the ATLAS TDAQ system
Proxy Objects • Represent external entities • Other controllers, processes etc • Member attributes exposed to expert system as facts • Member functions implement functionality in terms of Online software components • Example • Proxy objects represents child controllers • State of the object corresponds to state of the child (idle, configured, running) • Commands are forwarded to child controllers Control of the ATLAS TDAQ system
Controller Rules drive interactions between objects Proxy Objects Other Controllers External processes Control of the ATLAS TDAQ system
Status • Supervisor • Uses Framework • Run Control • Uses Framework • Verification system • CLIPS based • Choice of a common technology drives the path to an unified control system based on Controllers Control of the ATLAS TDAQ system
Scalability Test 2004 • Test bed • Up to 330 PCs of the CERN IT LXSHARE • 600 to 800 MHz to 2.4 GHZ Dual Pentium III • 256 to 512 MB • Linux RedHat 7.3 • Only control aspect verified • No Dataflow network • Various configurations • Servers on standard machines • Servers on dedicated high end machines Control of the ATLAS TDAQ system
Supervisor – Process Management • One Supervisor • PMG Agents • Startup limited by initialization of processes • Enhanced recoveryprocedures Supervisor P P P Control of the ATLAS TDAQ system
Startup with 1000 Controllers & 3000 processes in 40 to 100 seconds Several configurations: mon_standard has two additional processes for a controller Control of the ATLAS TDAQ system
Run Control • Usual RC tree • Actually 10 controllers on the lowest level • Variation of the number of intermediate nodes • Some central infrastructure • Name Service (IPC) • Information Sharing Control of the ATLAS TDAQ system
Transitions • 7 internal phases • With 1000 Controllers 2 to 6 seconds • No “real life” actions Again: More flexible error handling Control of the ATLAS TDAQ system
Combined Testbeam 2004 Stable operation from the start – Advantage of the component model Control of the ATLAS TDAQ system
Conclusions • New assessment of requirements • Overall Architecture • Controller studied in detail • CLIPS confirmed as technology choice • Design and implementation of a new framework • First test of new systems • Test beam • Scalability test • We can control a system of the size of the ATLAS TDAQ system • Much more flexible system • Common technology in various control components • Unified controllers in the future Control of the ATLAS TDAQ system