200 likes | 348 Views
NGOP Prototype Status Report . T.Levshina. N ext G eneration O peration GROUP. Integrated Systems Development Department Krzysztof Genser Terry Jones Tanya Levshina Igor Mandrichenko Don Petravick Operating Systems Support Department Troy Dawson Jim Fromm Lisa Giacchetti
E N D
NGOP PrototypeStatus Report T.Levshina
Next Generation OperationGROUP Integrated Systems Development Department Krzysztof Genser Terry Jones Tanya Levshina Igor Mandrichenko Don Petravick Operating Systems Support Department Troy Dawson Jim Fromm Lisa Giacchetti Marc Mengel Ken Schumacher Steven Timm Computing Services Department Rick Thies Rich Thompson ngop@fnal.gov
Presentation Highlights • NGOP project phases • Status of the Framework • Status of the prototype deployment • Near future milestones ngop@fnal.gov
NGOP Project Phases (since last HEPIX) • December 2000: First prototype implementation was released. • January 2001: Prototype installation on farms. Classes for farm administrators. • February 2001: Ngop server node in the operator console area was installed. Monitoring by operators was started. • March 2001: New release (“Swatch” and “PlugIns” Agents). Ngop was evaluated by system administrators, operators and others. Strategy meeting was carried out. • April 2001 “Xfalive” service (low-level ping) was provided for the all nodes monitored by Computing Services Department. ngop@fnal.gov
NGOP Architecture Report Generator Cluster A Archive Service Archive MA Monitor MA Administrator MA Central Server Configuraton File Management Service Persistent Config.Data Cluster B Cluster B1 • Monitored Objects • Host Element • Cluster System • NGOP Components • Sensor Agent Server • Monitoring Agent Monitoring • Data Storage Clients • Connections • TCP connection between • UDP Monitored Element • and MA • Not implemented in prototype yet MA MA Action Client MA s S s MA s Data Analyzer Router MA MA s s s s Performance Storage Service Cluster B2 Performance Data ngop@fnal.gov
Data Flow and NGOP Components Interaction ID=swap.nodeA State=Up Value=98 SevLevel=Error Dscrb=“swap > 95 %” MA Monitored Elements Monitor Monitor Action Request MA MA MA Monitored Elements Monitored Elements Monitored Elements Monitor Central Server Action Request MA Monitored Elements CVS ID=syslogd.nodeB State=Down Dscrb=“syslogd is down” MA Action Client Monitored Elements Configuration Service Archiver ngop@fnal.gov
Status of Framework(Implemented Components) • Monitoring Agent: • MA API (only Python binding) • PlugIns Agent (XML configuration is required) • Several types of MAs are provided in NGOP Prototype: • Linux Node "health" : • System Daemons presence • Critical File Systems presence and size • Cpu load • Memory utilization • Swap utilization • Number of users • Number of users’ processes • Number of processors • Baseboard temperature • Fan speed • “Xfalive”: • Node availability (low level ping) • Node reset • FBS : • FBS Daemons presence • Resources (“cpu” and scratch disk availability) • “Swatch” : • watches a log file for lines matching a regular expression, e.g. syslog or console log ngop@fnal.gov
Status of Framework(Implemented Components) • NGOP Central Server(NCS): • Gather events from MA’s • Scalable (so far ~ 512 nodes) • Provide users with requested information • Handle multiple users • Primitive locking mechanism to prevent simultaneous actions • Action broadcasting • Store information locally and forward it to Archive Storage • NGOP Configuration File Management Service: • Provide a central repository for system configuration and monitoring rules. • Perform configuration sanity check • Provide clients with component subscription list • Allow dynamic reconfiguration • Notify clients about new configuration ngop@fnal.gov
Status of Framework(Implemented Components) • Archive Server: • Handles archive storage (Oracle). • Provides a means to read and query the data (FNAL web interface: MISWEB) • Performs data roll out • Performs clean up procedure • Action Client: • Performs centralized actions • Verifies user authorization to perform the action • Notifies NCS about action exit status • Monitoring Client: • Allows to configure custom-built system views • Defines rules that determine the status of the system and their components • Requests and receives information about monitored objects • Determines the status of system based on the rules and obtained information • Initiates request to perform action. • All configuration files are written in XML ngop@fnal.gov
Status of Framework(Not yet implemented components) • Sensor Agent: Agent that collects performance data and generates events at a higher rate than a monitoring agent. • Performance Data Storage Service: Service that allows persistent storage of performance data, as well as means to read and query the data.Performance data will need to be consolidated. • Looping Monitoring Agent: Agent that is capable to received information form NCS, analyze it, derive new events and send it back to NCS. ngop@fnal.gov
CFMS Admin ngop@fnal.gov
NGOP Monitor(Configuration) ngop@fnal.gov
NGOP Monitor(Display) ngop@fnal.gov
NGOP Monitor(Display) ngop@fnal.gov
Prototype Statistics • Some implementation details: • Written primarily in Python (some modules in C) • ~ 10, 000 line of Python code and ~1,000 of C code • Use XML (and partially MATHML) for all configuration files • ~ 600 configuration files • Some deployment details: • Monitoring 512 nodes , checking for node being down and node reset. • Monitoring four farms (CDF, D0, two Fix Target experiment farms) - (270 nodes out of 512) • Number of Monitoring Agents ~ 557( 270 local MAs monitor operating system and sensors data on the farms, 270 local MAs monitor syslog on the farms, 4 MAs monitor FBS on corresponding farms, 13 MAs perform “xfalive” service) • Number of Monitored Objects ~ 6,500 • About 5 instances of “ngop monitor” (GUI) are running simultaneously. • Local event log is kept since January,12. • Rate is ~ 13 events per hour ngop@fnal.gov
Current Configuration CDF Farm FixTarget Farm cdffarm1 fnsfo MA (CDF_FBS) MA (FT_FBS) PPD MISCOMP CMS CDF D0 Kerberos FNALU Division Servers SDSS License Servers License Servers Mail Servers KTEV MINOS HPPC ODS BTEV Enstore D0 Farm fncdf 1 - 90 Fnpc 201 - 250 d0bbin Swatch Swatch MA (OSHealth) MA (OSHealth) MA (D0_FBS) fnd0 1 - 100 MA (OSHealth) NGOP MAs (Ping) Old FixTarget Farm User Node User Node User Node fnsfh NGOP Monitor NGOP Monitor NGOP Monitor MA (OFT_FBS) Config File Management Server NGOP Central Server fnpc 1 - 37 FNCDUH Swatch Action Client MA (OSHealth) Archive Service WWW Swatch ngop@fnal.gov
Summary Of Occurred Events • Detected Problems: • Node reset • Node is down • One CPU is missing after reboot • File system not mounted • System daemon is dead • FBS Batch Manager is down • Raised Alarms: • Memory usage is high • Swap usage is high • CPU Load is high • File System is full • Baseboard temperature is high • Specific messages found in syslog : nfs timeouts, drive timeouts … ngop@fnal.gov
Report Generator(MISCOMP Web Query Interface) ngop@fnal.gov
Next Milestone: From Prototype to Production System (for ~600 nodes) • Goal 1: Gradually give the System Managers a Framework to develop and evolve tools to locally monitor their systems and enable them to send filtered information to the CSD operators • Goal 2: Make sure all production systems can be supported by NGOP (excluding Windows2000 in the first phase) ngop@fnal.gov
Wish List: Improve the Production System • Provide Monitoring Client API • Implement Looping Agents • Implement historical rules and escalating alarms • Implement “snapshot” (“give me the updated system status now”) feature • Provide other than Python Monitoring Agent API • Fully Kerberize • Provide Standard Win2000 Monitoring Agents • Design and provide dynamic handling of configuration changes for the Monitoring Client • Allow for easier handling of multiple configurations • Improve Admin (Configuration Client) Client GUI • Provide Configuration GUI (hoping for a good free XML Editor though) • Provide Performance Data Framework • Redesign/Rewrite GUI (for scalability and friendliness) • Provide GUI for non-Linux platforms if really needed • Work on scalability up to 10000 hosts ngop@fnal.gov