680 likes | 813 Views
Outline. Part I: Introduction (Pedro A.) Part II: Technical Solutions (Massimo P., Benjamin F.) Transport Long Term Repository Analytics Visualization Part III: Experience by Services (Stefano Z., Spyros L.) OpenStack Monitoring Batch LSF Monitoring. History. 2012 ITTF slide.
E N D
Outline • Part I: Introduction (Pedro A.) • Part II: Technical Solutions (Massimo P., Benjamin F.) • Transport • Long Term Repository • Analytics • Visualization • Part III: Experience by Services (Stefano Z., Spyros L.) • OpenStack Monitoring • Batch LSF Monitoring
History 2012 ITTF slide • Motivation • Several independent monitoring activities in IT • similar overall approach, different tool-chains, similar limitations • High level services are interdependent • combination of data from different groups necessary, but difficult • Understanding performance became more important • requires more combined data and complex analysis • Move to a virtualized dynamic infrastructure • comes with complex new requirements on monitoring • Challenges • Find a shared architecture and tool-chain components while preserving our investment in monitoring
Architecture Visualization Notifications Analysis Long Term Repository Application Feed Feed Feed Transport Producer Producer Producer Producer
Strategy • Adopt open source tools • For each architecture block look outside for solutions • Large adoption and strong community support • Fast to adopt, test, and deliver • Easily replaceable by other (better) future solutions • Integrate with new CERN infrastructure • AI project, OpenStack, Puppet, Roger, etc. • Focus on simple adoption (e.g. puppet modules)
Technology Part II
Community • DB (web apps) • DSS (castor) • OIS (openstack) • PES (batch lsf) • SDC (wlcgmon.) • Sec. (netlog, snoopy) CF(lemon, syslog) Part III • Same technologies being used by different teams • HDFS: lemon, syslog, openstack, batch, security, castor • ES: lemon, syslog, openstack, batch • Kibana: lemon, syslog, openstack, batch
Part II Transport
Motivation • Scalable transport needed • Collect operations data • lemon metrics and syslog • 3rd party applications • Easy integration with providers/consumers • Apache Flume
Flume • Distributed service for collecting large amounts of data • Robust and fault tolerant • Horizontally scalable • Many ready to be used input/output plugins • Java based • Apache license • Cloudera is the main contributor • Using their releases • Less frequent but more stable releases
Data Flow • Flume event • Byte payload + set of string headers • Flume agent • JVM process hosting “source -> sink” flow(s)
Sources and Sinks • Many ready-to-be-used plugins • Sources • Avro, Thrift, JMS, Spool dir, Syslog, HTTP, … • Custom sources can be easily implemented • we do have a dirq source for our use case • Interceptors • Decorate events, filter events
Sources and Sinks • Many ready-to-be-used plugins • Channels • Memory, File, JDBC • Custom channels can be easily implemented • Sinks • Avro, Thrift, ElasticSearch, Hadoop HDFS & HBase, Java Logger, IRC, File, Null • Custom sinks can be easily implemented
Other Features • Fan-in and fan-out • Enable load balancing • Contextual routing • Based on logic implemented through selectors • Multi-hops flows • Enable layered topologies • Increase reliability, failure resistance
Limitations • Routing is static • On demand subscriptions are not possible • Requires reconfiguration and restart • No authentication/authorization features • Secure transport available • Java process on client side • Small memory footprint would be nicer
Our Deployment • Producers • All Puppet nodes • Lemon, Syslog, 3rd party applications • Gateway routing layer • 10 VMs behind DNS load balancer • Elasticsearch sink • 5 VMs behind DNS load balancer • Inserting to ElasticSearch • Hadoop HDFS sink • 5 VMs behind DNS load balancer • Inserting to Hadoop HDFS
Feedback • Needs tuning to correctly size flume layers • Available sources and sinks saved a lot of time
Long Term Repository Part II
Motivation • Store operations raw data • Long term archival required • Allow future data replay to other tools • Feed real-time engine • Offline processing of collected data • Security data? Syslog data? • Apache Hadoop/HDFS 20
Apache Hadoop • Framework that allows the distributed processing of large data sets across clusters • HDFS is a distributed filesystem designed to run on commodity hardware • Suitable for applications with large data sets • Designed for batch processing rather than interactive use • High throughput preferred to low latency access
Limitations • Small files not welcome • Blocks of 64M,128M • Tens of millions files limit per cluster • Namenode holding in memory files map • Transparent compression not available • Raw text could take much less space • Real-time data access is not possible 22
Our Usage • Cluster provided by IT/DSS • ~500 TB, 13 data nodes • Data stored by hostgroup • Total 1.8 TB since mid July 2013 • Daily jobs to aggregate data by month • Large files preferred to many small files by HDFS 23
Part II Analytics
Motivation • Real-time queries, clear API • Limited data retention • Multiple scopes technologies • Horizontally scalable and easy to deploy • ElasticSearch 25
ElasticSearch Distributed RESTful search and analytics engine 26
ElasticSearch Real time • Acquisition: data is indexed in real time • Analytics: explore, understand your data
ElasticSearch Schema free • No prior data declaration required • but possible, to optimize • Data is injected as-is • Automatic data type discovery Document oriented (JSON)
ElasticSearch • Full text search • Apache Lucene is used to provide full text search • lucene apache documentation • But not only text • integer/long • float/double • boolean • date • binary • ...
ElasticSearch • High availability • Shards and replicas auto balanced • RESTful JSON API [root@es-search-node ~] $ curl -XGET http://localhost:9200/_cluster/health?pretty=true { "cluster_name" : "itmon-es", "status" : "green", "timed_out" : false, "number_of_nodes" : 11, "number_of_data_nodes" : 8, "active_primary_shards" : 2990, "active_shards" : 8970, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 0 }
ElasticSearch • Used by many large companies • Soundcloud • “To provide immediate and relevant results for their online audio distribution platform reaching180 million people” • Github • “20TB of data using ElasticSearch, including 1.3 billion files and 130 billion lines of code” • Foursquare, Stackoverflow, Salesforce, ... • Distributed under Apache license
Limitations • Requires a lot of RAM (Java) • Especially on data nodes • IO intensive • Take into account when planning deployment • Shards re-initialisation takes some time (~1h) • Lots of shards and replicas per index, lots of indexes • Not frequent operation, only after full cluster reboot • Authentication not built-in (“bricolage”) • Apache+Shibboleth on top of Jetty plugin
Our Deployment • Fully puppetized • Production cluster • 2 master nodes (no data) • 16GB RAM, 8 cores CPU • 1 search node (no data) • 16GB RAM, 8 cores CPU • 8 data nodes • 48GB RAM, 24 cores CPU • 500GB SSD • Development cluster • Based on medium and large VMs
Our Deployment • Security: Jetty plugin • Access control, SSL (also requests logging, Gzip) • Monitoring: many plugins • ElasticHQ, BigDesk, Head, Paramedic, ...
Our Deployment • 1 index per day • flume-lemon-YYYY-MM-DD • flume-syslog-YYYY-MM-DD • 90 days TTL • 10 shards per index • 2 replicas per shards
Demo Production Cluster • ElasticHQ • HEAD
Feedback • Easy to deploy and manage • Robust, fast, and rich API • Easy query language (DSL) • More features coming with aggregation framework
Part II Visualisation
Motivation • Dedicated, dynamic and user-friendly dashboards • Horizontally scalable and easy to deploy • Kibana
Kibana Visualize time-stamped data from ElasticSearch
Kibana • “Make sense of a mountain of logs” • Designed to analyze log • Perfectly fits timestamped data (e.g. metrics) Profit from ElasticSearch power • Search/analyze features exploited
Kibana • No code required • Simply point & click to build your own dashboard
Kibana • Open source, community driven • Now fully integrated and supported by ElasticSearch • Provided code/feature contribution
Kibana Built with AngularJS • JavaScript MVC framework for client-side rich application • Developed and maintained by Google • No backend: web server delivers only static files • JS directly queries ElasticSearch
Kibana Easy to install • “git clone” OR “tar -xvzf” OR ElasticSearch plugin Easy to configure • 1-line config file to point to the ElasticSearch cluster • Save its own configuration in ElasticSearch itself • Possible to export/import dashboards configuration
Our Deployment Based on ElasticSearch plugin • To profit from Jetty authentication • Deployed together with search node • Public (read only) and private (read write) endpoints
Demo • Production Dashboards • Syslog • Lemon • PDUs
Feedback • Easy to deploy and use • Cool user interface • Fits many use cases • Text (syslog), metrics (lemon) • Still limited feature set • Under active development • Very active community and growing
OpenStack Monitoring Part III