1 / 25

HADOOP Monitoring and Diagnostics: Challenges and Lessons Learned

HADOOP Monitoring and Diagnostics: Challenges and Lessons Learned. Matthew Jacobs mj@cloudera.com. About this Talk. Building monitoring and diagnostic tools for Hadoop How we think about Hadoop monitoring and diagnostics Interesting problems we have

corine
Download Presentation

HADOOP Monitoring and Diagnostics: Challenges and Lessons Learned

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HADOOP Monitoring and Diagnostics:Challenges and Lessons Learned Matthew Jacobs mj@cloudera.com

  2. About this Talk • Building monitoring and diagnostic tools for Hadoop • How we think about Hadoop monitoring and diagnostics • Interesting problems we have • A few things we've learned in the process

  3. What is Hadoop? • Platform for distributed processing and storage of petabytes of data on clusters of commodity hardware • Operating system for the cluster • Services that interact and are composable • HDFS, MapReduce, HBase, Pig, Hive, ZK, etc... • Open source • Different Apache projects, different communities

  4. Managing the Complexity • Hadoop distributions, e.g. Cloudera's CDH • Packaged services, well tested • Existing tools • Ganglia, Nagios, Chef, Puppet, etc. • Management tools for Hadoop • ClouderaManager • Deployment, configuration, reporting,monitoring, diagnosis • Used by operators @ Fortune 50 companies

  5. Thinking about Hadoop Hadoop: services with many hostsrather than: hosts with many services • Tools should be service-oriented • Most general existing management tools are host-oriented

  6. Monitoring • Provide insight into the operation of the system • Challenges: • Knowing what to collect • Collecting, storing efficiently at scale • Deciding how to present data

  7. Hadoop Monitoring Data (1) • Operators care about • Resource and scheduling information • Performance and health metrics • Important log events • Come from • Metrics exposed via JMX (metrics/metrics2) • Logs (Hadoop services, OS) • Operating system (/proc, syscalls, etc.)

  8. Hadoop Monitoring Data (2) • Choosing what to collect • Not all! Some are just confusing • e.g. DN corrupt replicas vs. blocks with corrupt replicas • We’re filtering for users • Add more when we see customer problems • But… • Interfaces change between versions • Just messy

  9. Example Metric Data, HDFS • I/O metrics, read/write bytes, counts • Blocks, replicas, corruptions • FS info, volume failures, usage/capacity • NameNode info, time since checkpoint, transactions since checkpoint, num DNs failed • Many more...

  10. Hadoop Monitoring, What to show • Building an intuitive user interface is hard • Especially for a complex system like Hadoop • Need service-oriented view • Pre-baked visualizations (charts, heatmaps, etc.) • Generic data visualization capabilities • Experts know exactly what they want to see • e.g. chart number of corrupt DN block replicas by rack

  11. Diagnostics • Inform operators when something is wrong • E.g. datanode has too many corrupt blocks • Hard problem • No single solution • Need multiple tools for diagnosis • Really don't want to be wrong • Operators lose faith in the tool

  12. Health Checks • Set of rule-based checks for specific problems • Simple, stateless, based on metric data • Well targeted, catch real problems • Easy to get 'right' • Learn from real customer problems • Add checks when customers hit hard-to-diagnose problems • E.g. customer saw slow HBase reads • Hard to find! bad switch → packet frame errors

  13. Health Checks, Examples • HDFS missing blocks, corrupt replicas • DataNode connectivity, volume failures • NameNode checkpoint age, safe mode • GC duration, number file descriptors, etc... • Canary-based checks • e.g. can write a file to HDFS,can perform basic HBase operations • Many more...

  14. Health Checks (2) • Not good for performance, context-aware issues • Have to build manually, time consuming • Can take these further • Add more knowledge about root cause • Taking actions in some cases

  15. Anomaly Detection • Simple statistics, e.g. std deviation • More clever machine learning algorithms • Local outliers in high-dimensional 'metric space' • Streaming algorithm seems feasible • Identify what's abnormal for a particular cluster • Must use carefully – outlier != problem • Measure of 'potential interestingness'

  16. Other Diagnostic Tools and Challenges • Anomaly detection via log data • Need data across services • E.g. slow HBase reads caused by HDFS latency • Better instrumentation in platform • E.g. Dapper-like tracing through the stack HBASE-6449 • Future work to extend to HDFS

  17. Challenges: HadoopFault Tolerance • Hadoopis built to tolerate failures • E.g. HDFS replication • Not clear when to report a problem • E.g. 1 failed DN maybe not concerning enough

  18. Challenges in Diagnostics (2) • Entities interact • E.g. health of HDFS depends on health of DNs, NNs, etc… • Relations describe graph of computation to evaluate health • Evaluating cluster/service/host health becomes challenging • Data arrives from different sources at different times • When to evaluate health? Every minute? When data changes? • Complete failures >> partial failures

  19. Challenges Operating at Scale (1) • Building a distributed system to monitor a distributed system • Collect metrics for lots of 'things' (entities) • DataNodes, NameNodes, TaskTrackers, JobTrackers, RegionServers, Regions, etc. • Hosts, disks, NICs, data directories, etc. • Aggregate many metrics too • e.g. aggregate DN metrics → HDFS-wide metrics aggregate region metrics → table metrics • Cluster-wide, service-wide, rack-wide, etc. • Becomes a big data problem

  20. Challenges Operating at Scale (2) • At 1000 nodes... • Hundreds of thousands of entities • Millions of metrics written per minute • Increase polling? Every 30 sec? 10 sec? • Simple RDBMS is OK for a while... • Shard, partition, etc.

  21. Storage for Monitoring Data • Hadoop (HBase) + OpenTSDB is great • But we don't eat our own tail... • Can use other TS databases • Modify Hbase, make 'embedded' version • Just a single node, just a single Region • Or use LevelDB • Fast key-value store from Google, open source

  22. LevelDB (or HBase), an example • Data model, simplified • Have time series for many entities: tsId • e.g. DNs, Regions, hosts, disks, etc. • Have many metric streams: metricId • e.g. DN bytes read, JVM gc count, etc. • LevelDB, fast key-value store • Key: byte array of “<tsId><metricId><timestamp>” • Value: data

  23. LevelDB example (2) • Can write many data points per row • Timestamp in key is timestamp base • Write each data point time delta before value • E.g. value: “<delta1><val1><delta2><val2>...”or “<delta1><delta2>...<val1><val2>...” • Will compress well • Very similar to what OpenTSDBdoes

  24. Questions?

More Related