1 / 68

Outline

Outline. Part I: Introduction (Pedro A.) Part II: Technical Solutions (Massimo P., Benjamin F.) Transport Long Term Repository Analytics Visualization Part III: Experience by Services (Stefano Z., Spyros L.) OpenStack Monitoring Batch LSF Monitoring. History. 2012 ITTF slide.

nedra
Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Outline • Part I: Introduction (Pedro A.) • Part II: Technical Solutions (Massimo P., Benjamin F.) • Transport • Long Term Repository • Analytics • Visualization • Part III: Experience by Services (Stefano Z., Spyros L.) • OpenStack Monitoring • Batch LSF Monitoring

  2. History 2012 ITTF slide • Motivation • Several independent monitoring activities in IT • similar overall approach, different tool-chains, similar limitations • High level services are interdependent • combination of data from different groups necessary, but difficult • Understanding performance became more important • requires more combined data and complex analysis • Move to a virtualized dynamic infrastructure • comes with complex new requirements on monitoring • Challenges • Find a shared architecture and tool-chain components while preserving our investment in monitoring

  3. Architecture Visualization Notifications Analysis Long Term Repository Application Feed Feed Feed Transport Producer Producer Producer Producer

  4. Strategy • Adopt open source tools • For each architecture block look outside for solutions • Large adoption and strong community support • Fast to adopt, test, and deliver • Easily replaceable by other (better) future solutions • Integrate with new CERN infrastructure • AI project, OpenStack, Puppet, Roger, etc. • Focus on simple adoption (e.g. puppet modules)

  5. Technology Part II

  6. Community • DB (web apps) • DSS (castor) • OIS (openstack) • PES (batch lsf) • SDC (wlcgmon.) • Sec. (netlog, snoopy) CF(lemon, syslog) Part III • Same technologies being used by different teams • HDFS: lemon, syslog, openstack, batch, security, castor • ES: lemon, syslog, openstack, batch • Kibana: lemon, syslog, openstack, batch

  7. Part II Transport

  8. Motivation • Scalable transport needed • Collect operations data • lemon metrics and syslog • 3rd party applications • Easy integration with providers/consumers • Apache Flume

  9. Flume • Distributed service for collecting large amounts of data • Robust and fault tolerant • Horizontally scalable • Many ready to be used input/output plugins • Java based • Apache license • Cloudera is the main contributor • Using their releases • Less frequent but more stable releases

  10. Data Flow • Flume event • Byte payload + set of string headers • Flume agent • JVM process hosting “source -> sink” flow(s)

  11. Sources and Sinks • Many ready-to-be-used plugins • Sources • Avro, Thrift, JMS, Spool dir, Syslog, HTTP, … • Custom sources can be easily implemented • we do have a dirq source for our use case • Interceptors • Decorate events, filter events

  12. Sources and Sinks • Many ready-to-be-used plugins • Channels • Memory, File, JDBC • Custom channels can be easily implemented • Sinks • Avro, Thrift, ElasticSearch, Hadoop HDFS & HBase, Java Logger, IRC, File, Null • Custom sinks can be easily implemented

  13. Other Features • Fan-in and fan-out • Enable load balancing • Contextual routing • Based on logic implemented through selectors • Multi-hops flows • Enable layered topologies • Increase reliability, failure resistance

  14. Limitations • Routing is static • On demand subscriptions are not possible • Requires reconfiguration and restart • No authentication/authorization features • Secure transport available • Java process on client side • Small memory footprint would be nicer

  15. Our Deployment

  16. Our Deployment • Producers • All Puppet nodes • Lemon, Syslog, 3rd party applications • Gateway routing layer • 10 VMs behind DNS load balancer • Elasticsearch sink • 5 VMs behind DNS load balancer • Inserting to ElasticSearch • Hadoop HDFS sink • 5 VMs behind DNS load balancer • Inserting to Hadoop HDFS

  17. Feedback • Needs tuning to correctly size flume layers • Available sources and sinks saved a lot of time

  18. Long Term Repository Part II

  19. Motivation • Store operations raw data • Long term archival required • Allow future data replay to other tools • Feed real-time engine • Offline processing of collected data • Security data? Syslog data? • Apache Hadoop/HDFS 20

  20. Apache Hadoop • Framework that allows the distributed processing of large data sets across clusters • HDFS is a distributed filesystem designed to run on commodity hardware • Suitable for applications with large data sets • Designed for batch processing rather than interactive use • High throughput preferred to low latency access

  21. Limitations • Small files not welcome • Blocks of 64M,128M • Tens of millions files limit per cluster • Namenode holding in memory files map • Transparent compression not available • Raw text could take much less space • Real-time data access is not possible 22

  22. Our Usage • Cluster provided by IT/DSS • ~500 TB, 13 data nodes • Data stored by hostgroup • Total 1.8 TB since mid July 2013 • Daily jobs to aggregate data by month • Large files preferred to many small files by HDFS 23

  23. Part II Analytics

  24. Motivation • Real-time queries, clear API • Limited data retention • Multiple scopes technologies • Horizontally scalable and easy to deploy • ElasticSearch 25

  25. ElasticSearch Distributed RESTful search and analytics engine 26

  26. ElasticSearch Real time • Acquisition: data is indexed in real time • Analytics: explore, understand your data

  27. ElasticSearch Schema free • No prior data declaration required • but possible, to optimize • Data is injected as-is • Automatic data type discovery Document oriented (JSON)

  28. ElasticSearch • Full text search • Apache Lucene is used to provide full text search • lucene apache documentation • But not only text • integer/long • float/double • boolean • date • binary • ...

  29. ElasticSearch • High availability • Shards and replicas auto balanced • RESTful JSON API [root@es-search-node ~] $ curl -XGET http://localhost:9200/_cluster/health?pretty=true { "cluster_name" : "itmon-es", "status" : "green", "timed_out" : false, "number_of_nodes" : 11, "number_of_data_nodes" : 8, "active_primary_shards" : 2990, "active_shards" : 8970, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 0 }

  30. ElasticSearch • Used by many large companies • Soundcloud • “To provide immediate and relevant results for their online audio distribution platform reaching180 million people” • Github • “20TB of data using ElasticSearch, including 1.3 billion files and 130 billion lines of code” • Foursquare, Stackoverflow, Salesforce, ... • Distributed under Apache license

  31. Limitations • Requires a lot of RAM (Java) • Especially on data nodes • IO intensive • Take into account when planning deployment • Shards re-initialisation takes some time (~1h) • Lots of shards and replicas per index, lots of indexes • Not frequent operation, only after full cluster reboot • Authentication not built-in (“bricolage”) • Apache+Shibboleth on top of Jetty plugin

  32. Our Deployment • Fully puppetized • Production cluster • 2 master nodes (no data) • 16GB RAM, 8 cores CPU • 1 search node (no data) • 16GB RAM, 8 cores CPU • 8 data nodes • 48GB RAM, 24 cores CPU • 500GB SSD • Development cluster • Based on medium and large VMs

  33. Our Deployment • Security: Jetty plugin • Access control, SSL (also requests logging, Gzip) • Monitoring: many plugins • ElasticHQ, BigDesk, Head, Paramedic, ...

  34. Our Deployment • 1 index per day • flume-lemon-YYYY-MM-DD • flume-syslog-YYYY-MM-DD • 90 days TTL • 10 shards per index • 2 replicas per shards

  35. Demo Production Cluster • ElasticHQ • HEAD

  36. Feedback • Easy to deploy and manage • Robust, fast, and rich API • Easy query language (DSL) • More features coming with aggregation framework

  37. Part II Visualisation

  38. Motivation • Dedicated, dynamic and user-friendly dashboards • Horizontally scalable and easy to deploy • Kibana

  39. Kibana Visualize time-stamped data from ElasticSearch

  40. Kibana • “Make sense of a mountain of logs” • Designed to analyze log • Perfectly fits timestamped data (e.g. metrics) Profit from ElasticSearch power • Search/analyze features exploited

  41. Kibana • No code required • Simply point & click to build your own dashboard

  42. Kibana • Open source, community driven • Now fully integrated and supported by ElasticSearch • Provided code/feature contribution

  43. Kibana Built with AngularJS • JavaScript MVC framework for client-side rich application • Developed and maintained by Google • No backend: web server delivers only static files • JS directly queries ElasticSearch

  44. Kibana Easy to install • “git clone” OR “tar -xvzf” OR ElasticSearch plugin Easy to configure • 1-line config file to point to the ElasticSearch cluster • Save its own configuration in ElasticSearch itself • Possible to export/import dashboards configuration

  45. Our Deployment Based on ElasticSearch plugin • To profit from Jetty authentication • Deployed together with search node • Public (read only) and private (read write) endpoints

  46. Demo • Production Dashboards • Syslog • Lemon • PDUs

  47. Feedback • Easy to deploy and use • Cool user interface • Fits many use cases • Text (syslog), metrics (lemon) • Still limited feature set • Under active development • Very active community and growing

  48. OpenStack Monitoring Part III

  49. Experience with OpenStack 

More Related