260 likes | 372 Views
Agile Infrastructure Update Monitoring. Pedro Andrade – IT/GT 6 th July 2012 IT Technical Forum. Overview. Introduction Motivation, Challenge, Architecture Status Update Producers, Messaging, Storage/Analysis, Visualization Milestones and Next Steps Summary and Conclusions.
E N D
Agile Infrastructure UpdateMonitoring Pedro Andrade – IT/GT 6th July 2012 IT Technical Forum
Overview • Introduction • Motivation, Challenge, Architecture • Status Update • Producers, Messaging, Storage/Analysis, Visualization • Milestones and Next Steps • Summary and Conclusions 2
Introduction • Motivation • Several independent monitoring activitiesin IT • Based on different tool-chain but sharing same limitations • High level services are interdependent • Combination of data and complex analysis necessary • Challenge • Find a shared architecture and tool-chain components • Adopt existing tools and avoid home grown solutions • Aggregate monitoring data in a large data store • Correlate monitoring data and make it easy to access 3
Architecture Portal Report Alarm Portal Splunk Storage/Analysis Hadoop Application Specific Oracle Analysis/Storage Feed Alarm Feed Custom Feed Aggregation Apollo Publisher Sensor Publisher Sensor Lemon 4
Status Update Castor Cockpit SNOW Splunk Quattor + Puppet Hadoop Castor Hdp. Consumer Castor Logs Consumer Lemon Snow Consumer Lemon Spk. Consumer Apollo Security Netlog Castor Logs Producer Lemon Producer 5
Producers • Lemon Producer implemented and tested • Lemon Agent + Lemon Forwarder • Supports publication of notifications and metrics • Retrieves local metadata via puppet(notification targets) • Tested on approximately 500 quattornodes • Mocked publication of notifications from all quattor nodes • Castor Log Producer implemented and tested • Publishes parsed castor logs in messaging broker • Generic producer supporting different input sources • Tested in 30 development nodes 6
Messaging Broker • Apollo broker deployed and tested • Few initial problems found, service running smoothly now • Our deployment scenario hit a bug (fast feedback) • Currently exploited by two monitoring apps • Lemon: 5 msg/sec, avgsize of 3KB (500 hosts) • Castor Logs: 120 msg/sec,avegsize of 11KB (30 hosts) • Producers and consumers using CERN msg tools • mig-admin-utils-stompclt, messaging (python, perl) 7
Storage and Analysis • Small Hadoop cluster deployed and tested • Being upgraded to latest CDH v4 • Currently exploited by two monitoring apps • Security Netlog • Data being imported directly to Hadoop via Fuse • Goal is to run analysis on network activity • Castor Logs • Development of messaging consumer to Hadoop just started • Testing Hadoop components: HBase and Flume • Hive and Sqoop ?? 8
Visualization and Alarms • Splunk deployed and tested • So far only exploited by Lemon • Lemon notifications continuously added to Splunk using the Lemon Splunk consumer • Lemon metrics manually imported: 1,5 years of data with 1335 metrics for all CC nodes (8,5 TB) • Splunk playground available: https://lxfssm4508.cern.ch • SNOW integration implemented and tested • Lemon notifications delivered as SNOW tickets assigned to correct target (FE) using the Lemon SNOW consumer • Improving tickets routing 9
Milestones 10
Next Steps Castor Cockpit Snow Splunk Correlation Engine Dashboards Hadoop Generic Consumer Tests for Production Castor Hdp. Consumer Hadoop Consumer Castor Logs Consumer Lemon Snow Consumer Lemon Spk. Consumer Apollo Security Netlog Castor Logs Producer Lemon Producer Other Producers 11
Summary • All layers of the proposed monitoring architecture successfully tested with an initial set of tools • Apollo, Hadoop, Splunkdeployed and tested • New components implemented and tested (messaging) • Partial functional and scalability tests • Several concrete results with real data achieved • Castor logs data aggregation via Apollo • Lemon notifications aggregation via Apollo • Lemon notifications visualization in Splunk • Security netlog data stored in Hadoop • Notifications mechanism tested with Lemon data 12
Summary • Base monitoring for AI nodes ongoing • Eating our own dog food • Several contacts established with other teams • Different IT groups attending monitoring meetings • BE and GS: discussion of similar projects • LHCb and ATLAS online teams: sharing experiences • Crucial for uptake by users • Other monitoring applications are welcome to join • More use cases, more data, (more) correlation 13
Conclusion • Work progressing as planned • Core components of the architecture in place • Ready to be used and evaluated • AI Monitoring needs YOU ! • Move (part of) your monitoring apps 14
Thank You ! 15
Introduction • Motivation • Several independent monitoring activities in IT • Similar overall approach, different tool-chains, similar limitations • High level services are interdependent • Combination of data from different groups necessary, but difficult • Understanding performance became more important • Requires more combined data and complex analysis • Move to a virtualized dynamic infrastructure • Comes with complex new requirements on monitoring • Challenge • Find a shared architecture and tool-chain components while preserving/improving our investment in monitoring 17
Monitoring in IT • More then 30 monitoring applications • Number of producers: ~40k • Input data volume: ~280 GB per day • Covering a wide range of different resources • Hardware, OS, applications, files, jobs, etc. • Application-specific monitoring solutions • Using different technologies (including commercial tools) • Sharing similar needs: aggregate metrics, get alarms, etc • Limited sharing of monitoring data 18
Architecture • Data • Aggregate monitoring data in a large data store • For storage and combined analysis tasks • Make monitoring data easy to access by everyone • Not forgetting possible security constraints • Select a simple and well supported data format • Technology • Follow a tool chain approach • Each tool can be easily replaced by a better one • Select well established solutions • Adopt existing tools by avoid home grown solutions • Allow a phased transition to the new architecture 19
Work Summary • Tested the tools initially selected • And the data workflow between the tools • Worked with concrete monitoring apps and data • Lemon (CF), Castor Logs (DSS), Net Logger (DI) • Defined and tested different monitoring paths • Notifications Vs. Analysis • Supported two different environments • Puppet nodes to be ready for new AI infrastructure • Quattor nodes to run large scale tests 20
Producers and Sensors Visualization Analysis Storage Sensor Messaging Integrated Product • Monitoring data generated by all resources • Monitoring metadata available at the node • Published to messaging using common libraries • May also be produced as a result of pre-aggregation or post-processing tasks • Support and integrate closed monitoring solutions • By injecting final results into the messaging layer or exporting relevant data at an intermediate stage 21
Messaging Broker • Monitoring data transported via messaging • Provide a network of messaging brokers • Support for multiple configurations • The needs of each monitoring application must be clearly analyzed and defined • Total number of producers and consumers • Size of the monitoring message • Rate of the monitoring message • Realistic testing environments are required to produce reliable performance numbers • First tests with Apollo (ActiveMQ) • Prior positive experience in IT and the experiments 22
Storage and Analysis • Monitoring data stored in a common location • Easy the sharing of monitoring data and analysis tools • Allows feeding into the system data already processed • NoSQL technologies are the most suitable solutions • Focus on column/tabular solutions • First tests with Hadoop (Clouderadistribution) • Prior positive experience in IT and the experiments • Map-reduce paradigm is a good match for the use cases • Has been used successfully at scale • Several related modules available (Hive, HBase) • For particular use cases a parallel relational database solution (Oracle) can be considered 23
Notifications and Dashboards • Provide an efficient delivery of notifications • Notifications directly sent to correct consumer targets • Possible targets: operators, service managers, etc. • Provide powerful dashboards and APIs • Complex queries on cross-domain monitoring data • First tests with Splunk 24
Links • Monitoring WG Twiki (new location!) • https://twiki.cern.ch/twiki/bin/view/MonitoringWG/ • Monitoring WG Report (ongoing) • https://twiki.cern.ch/twiki/bin/view/MonitoringWG/MonitoringReport • Agile Infrastructure TWiki • https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/ • Agile Infrastructure JIRA • https://agileinf.cern.ch/jira/browse/AI 25
Next Steps Castor Cockpit Snow Splunk Lemon Correlation Engine Hadoop Cluster Test in Production Generic Consumer Castor Hdp. Consumer Castor Logs Consumer Lemon Snow Consumer Lemon Spk. Consumer Apollo Security Net Logger Castor Logs Producer Lemon Forwarder Lemon Agent Other Producers 26