ATLAS MonALISA-Based DA Monitoring & Its Scalability Tests

ATLAS MonALISA-Based DA Monitoring&Its Scalability Tests DOSAR Meeting Aug. 17, 2006 Jae Yu For UT Arlington Grid Team

OUTLINE • What is MonALISA and why use it? • MonALISA Mechanism • MonALISA’s Interactive Clients • Methods of collecting information • MonALISA Scalability test results • Numerical value testing without MonALISA repository • Numerical and string values w/ repository • Resource usage measurements • Current status • Conclusions ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

What is ? • MonALISA: Monitoring Agents using a Large Integrated Services Architecture • Developed by the group led by Harvey Newman at CalTech and other groups in the global grid community • CMS dominant developers • Working closely with the grid community: OSG, LCG, EGEE, etc • Supported by DOE and NSF grants and possibly some European • Provides complete monitoring, control and global optimization services for complex systems ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

Why ? • Has a strong support from US funding agencies • Continued support from numerous institutions in the US and Europe • Provides all the resource monitoring for the DA needs • Flexible and modular in adding monitoring parameters • Availability of monitoring information for efficient brokering • Unified graphical grid monitoring across the experiments • CMS has implemented monitoring system through their BOSS (Batch Object Submission System) • BOSS is an interface to the local batch system that submits, monitors jobs in real-time, and stores the information into local data base • BOSS, to my mind, does exactly the same thing as, for example, panda executor or eowin • Michael Thomas in CalTech is the lead developer of the monitoring system and is willing to work with us ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

Interactive Clients Monitoring Modules ApMon Host running MonALISA service Lookup Services hosts (Monalisa.caltech.edu) ApMon Host running MonALISA service Monitoring Modules Lookup Services hosts (Monalisa.cern.ch) Host running MonALISA service Interactive Clients ApMon Monitoring Modules MonALISA Mechanism ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

MonALISA’s Interactive Client Provides graphical user interface for global monitoring of the resources ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

ApMon Other Existing modules MonALISA Service PBSjobs module Hawkeye module Ganglia modules OSG modules MonALISA: Monitoring Modules In addition to already existing modules, one can obtain additional information through ApMon ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

ApMon (Application Monitoring Module) ApMons act as information extractors and senders, communicating to applications and MonALISA services ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

MonALISA Repositories • Non-graphical MonALISA client • Stores received values of parameters in a local database long term • This could reside within the cluster or within the region • This database can be queried by other MonALISA service agents • Has a SOAP interface to the MonALISA server to get any types of values, especially strings ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

Scalability Tests • MonALISA is designed to handle large scale, grid based computing resource monitoring • Can it do the job in ATLAS environment? • 1 million simultaneous jobs reporting 15 parameters each every 15 minutes randomly? • Can MonALISA handle 1 million hits per minute? • Two tests • Without repository and with only numerical values • With repository and with both numerical and string values ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

JINI-LUSs monalisa.caltech.edu /monalisa.cern.ch PC1 ApMon-01 ApMon-02 . . . ApMon-N PC2 MonALISA Service Data PC3 MonALISA Repository Test Configuration • Built a test cluster using our old farm nodes at UTA • Three dual P-III 900MHz machines • PC1:Simulates worker nodes and clusters where jobs run, sending numerical values only • PC2: Simulates MonALISA service client • PC3: Runs MonALISA Repository • Test Configuration ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

Interactive Client Data JINI-LUSs monalisa.caltech.edu monalisa.cern.ch PC1 ApMon-01 ApMon-02 . . . ApMon-N PC2 MonALISA Service Data PC3 MonALISA Repository Test w/ Numerical Parameters and w/o Repository • Gradually increased the number of ApMons running simultaneously, checking the CPU usage of each machine (PC1, PC2, and PC3) • The test stops when any one of these machines reached its maximum ability. • This is the maximum scalability of MonALISA ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

Simulated Hit Rate in the Numerical Test • In PC1 (hepfm018) • Each ApMon program simulates a farm with 10 nodes, each node has 10 parameters, a total of100 parameters from one farm • In each second, 1 datagram’s value is sent from each node • So a total of 100 datagrams sent in 1 second from each farm represented by each ApMon • Activated 10 such ApMon programs  10 farms • A total of 1000 datagrams sent in 1 second • Hit rate of 1 kHz simulated, 1/15 of the target rate ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

MonALISA Client Display A Client selection box for the numerical test Time series of a parameter ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

CPU Usage from the Initial Test • Average CPU Usage: • PC1 (ApMons): less than 50%, average, in a period of 4 hours • The expected load on a worker node due to ApMon is 100 times less than this since we have run 100 ApMons on this machine • PC2 (ML Service): less than 5%, average, in a period of 4 hours • This heavily depends on the usage of the monitoring interface ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

Test w/ MonALISA Repository + String Values • Implemented the MonALISA Repository • To simulate complete MonALISA configuration • Configured to accept custom values, strings • Configured the ML Repository’s Web Interface to • Display the time series of numerical values generated by our ApMons • Implemented the MonALISA Web Service Clients • Configured to show custom values stored in MonALISA Repository ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

Example Time Series from ML Repository ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

Issues with MonALISA-based Monitoring • Issues • On what form do we need non-numerical values displayed? • How do we make the monitoring more selective? • Some MonALISA team suggested solutions: • Use a command-line based ML Web Service Clients: • It can interrogate the ML Web Service of the ML Repository • It can show both the numerical values and the non-numerical values • It can show selected values that we only care about ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

Test parameters w/ ML Repository • An example test w/ command-line client: • If we want to see the values of Parameter_0 of Node_006_3 of TestCluster_006 in the past 1 minute • We can run: • ./run_client "heptest017" "TestCluster_006" "Node_006_3" "Parameter_0" -60000 0 • Arguments: • - farm name • - cluster name • - node name • - parameter name • - from time : in milliseconds; -60000 means from the past 60 seconds • - to time : in milliseconds; 0 means now • - URL of ML Web Service – optional : default is • http://localhost:8080/axis/services/MLWebService ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

String Value Test w/ ML Repository • The value of the string parameters look: • “This is the 46 value of Parameter_9 on Node_0 of TestCluster_String_000” • A counter added into the value to check if there are any lost values • That is the test value for each parameter of each node in each Cluster, • Sent to the repository several times. • We count the frequency it was sent, to make sure if there was value lost. • Unlike in the numerical test, no values have been lost • But we’ve observed a duplicate display ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

CPU Usage w/ MonALISA Repository • The CPU usage (String Value): • PC1 (ApMon): less than 50% on average • Good level of load for worker nodes • PC2 (ML Client Services): less than 5% on average • Depends on the load of monitoring activities • PC3 (ML Repository): less than 30% on average • The level of CPU usage if there is only one repository serving a region of 10 farms of 10 nodes each ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

Testing the scalability extent of MonALISA • First test setup had a total of 20 ApMons running, separately 10 at a time: • 10 ApMon programs generating numeric values. • 10 ApMon programs generating string values. • Added the following programs to generate values to be monitored by the service and added to the repository. • 10 ApMon programs to generate numeric values. • 10 ApMon programs to generate string values. • New test ran the entire experiment with a total of 40 programs running concurrently and generating parameters. ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

Scalability Test Setup • 40 copies of ApMon programs running concurrently. • Each ApMon copy simulates a cluster with 10 nodes. • Each node generates 10 numeric or string parameters. • There is a 10-ms interval between any two transfers of values of parameters from a given node. • In total, each cluster produces 200,000 datagrams. • In 1s, approximately 2000 datagrams are sent. ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

Comparison of two scalability tests ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

Results of Scalability Test In unit of 900Mhz PIII CPU percentage use Machine w/ ApMons Machine w/ Repository Machine w/ MN server ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

Results of Scalability Test in SI95 Machine w/ ApMons Machine w/ Repository Machine w/ MN server ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

ATLAS DA Dashboard • LCG sites report to One MonALISA service and one repository • CERN colleagues implemented an ATLAS DA dashboard • OSG Sites different • Extremely democratic  Each site has its own MonALISA server and repository • An Apmon developed for each job to report to MonALISA server • MonALISA server will respond when prompted by the Dashboard • A design for ATLAS OSG sites proposed ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

Proposed MonALISA Based Panda Monitoring System MonALISA Service MonALISA Repository or Web Service client DashBoard DB MonALISA Logging Mechanism (ApMons) Logging Mechanism (File, Http) DashBoard Client Side (Pilot, job, Scheduler) ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

ATLAS DA Dashboard (PanDashboard?) • Test setup being implemented at DPCC@UTA • Two machines previously used for initial testing are prepared to work as the ATLAS DA MonALISA server and repository • A node at DPCC taken out to serve as a Panda resource and run the Apmon • The node needs to advertise itself to Panda • The goal is to complete initial implementation of OSG dashboard by September Software workshop • This completes the first integrated ATLAS DA dashboard ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

Conclusions • This is an effort resulted from the DOSAR workshop at SPRACE in collaboration with CMS folks • MonALISA has been chosen as a complementary tool for ATLAS DA monitoring • Scalability test demonstrates no rate issues • Display of string values need to be worked out • Submitted two documents • Feasibility of MonALISA based application monitoring • Scalability tests • MonALISA based DA dashboard implemented for LCG • Implementation on OSG in progress • Have been working very closely with CERN and BNL team ATLAS Application Monitoring w/ MonALISA J. Yu, UTA

ATLAS MonALISA-Based DA Monitoring & Its Scalability Tests