390 likes | 560 Views
How to monitor the $H!T out of Hadoop. Developing a comprehensive open approach to monitoring hadoop clusters. Relevant Hadoop Information. From 3 – 3000 Nodes Hardware/Software failures “common” Redundant Components DataNode, TaskTracker
E N D
How to monitor the $H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters
Relevant Hadoop Information • From 3 – 3000 Nodes • Hardware/Software failures “common” • Redundant Components DataNode, TaskTracker • Non-redundant Components NameNode, JobTracker, SecondaryNameNode • Fast Evolving Technology (Best Practices?)
Monitoring Software • Nagios – • Red Yellow Green Alerts, Escalations • Defacto Standard – Widely deployed • Text base configuration • Web Interface • Pluggable with shell scripts/external apps • Return 0 - OK
Cacti • Performance Graphing System • RRD/RRA Front End • Slick Web Interface • Template System for Graph Types • Pluggable • SNMP input • Shell script /external program
hadoop-cacti-jtg • JMX Fetching Code w/ (kick off) scripts • Cacti templates For Hadoop • Premade Nagios Check Scripts • Helper/Batch/automation scripts • Apache License
Sample Cluster P1 • NameNode & SecNameNode • Hardware RAID • 8 GB RAM • 1x QUAD CORE • DerbyDB (hive) on SecNameNode • JobTracker • 8GB RAM • 1x QUAD CORE
A Sample Cluster p2 • Slave (hadoopdata1-XXXX) • JBOD 8x 1TB SATA Disk • RAM 16GB • 2x Quad Core
Prerequisites • Nagios (install) DAG RPMs • Cacti (install) Several RPMS • Liberal network access to the cluster
Alerts & Escalations • X nodes * Y Services = < Sleep • Define a policy • Wake Me Up’s (SMS) • Don’t Wake Me Up’s (EMAIL) • Review (Daily, Weekly, Monthly)
Wake Me Up’s • NameNode • Disk Full (Big Big Headache) • RAID Array Issues (failed disk) • JobTracker • SecNameNode • Do not realize it is not working too late
Don’t Wake Me Up’s • Or ‘Wake someone else up’ • DataNode • Warning Currently Failed Disk will down the Data Node (see Jira) • TaskTracker • Hardware • Bad Disk (Start RMA) • Slaves are expendable (up to a point)
Monitoring Battle Plan • Start With the Basics • Ping, Disk • Add Hadoop Specific Alarms • check_data_node • Add JMX Graphing • NameNodeOperations • Add JMX Based alarms • FilesTotal > 1,000,000 or LiveNodes < 50%
The Basics Nagios • Nagios (All Nodes) • Host up (Ping check) • Disk % Full • SWAP > 85 % * Load based alarms are somewhat useless 389% CPU load is not necessarily a bad thing in Hadoopville
The Basics Cacti • Cacti (All Nodes) • CPU (full CPU) • RAM/SWAP • Network • Disk Usage
RAID Tools • Hpacucli – not a Street Fighter move • Alerts on RAID events (NameNode) • Disk failed • Rebuilding • JBOD (DataNode) • Failed Drive • Drive Errors • Dell, SUN, Vendor Specific Tools
Before you jump in • X Nodes * Y Checks * = Lots of work • About 3 Nodes into the process … • Wait!!! I need some interns!!! • Solution S.I.C.C.T. Semi-Intelligent-Configuration-cloning-tools • (I made that up) • (for this presentation)
Nagios • Answers “IS IT RUNNING?” • Text based Configuration
Cacti • Answers “HOW WELL IS IT RUNNING?” • Web Based configuration • php-cli tools
Monitoring Battle PlanThus Far • Start With the Basics • Ping, Disk !!!!!!Done!!!!!! • Add Hadoop Specific Alarms • check_data_node • Add JMX Graphing • NameNodeOperations • Add JMX Based alarms • FilesTotal > 1,000,000 or LiveNodes < 50%
Add Hadoop Specific Alarms • Hadoop Components with a Web Interface • NameNode 50070 • JobTracker 50030 • TaskTracker 50060 • DataNode 50075 • check_http + regex = simple + effective
nagios_check_commands.cfg define command { command_name check_remote_namenode command_line $USER1$/check_http -H $HOSTADDRESS$ -u http://$HOSTADDRESS$:$ARG1$/dfshealth.jsp -p $ARG1$ -r NameNode }define service { service_description check_remote_namenode use generic-service host_name hadoopname1 check_command check_remote_namenode!50070} • Component Failure • (Future) Newer Hadoop will have XML status
Monitoring Battle Plan • Start With the Basics • Ping, Disk (Done) • Add Hadoop Specific Alarms • check_data_node (Done) • Add JMX Graphing • NameNodeOperations • Add JMX Based alarms • FilesTotal > 1,000,000 or LiveNodes < 50%
JMX Graphing • Enable JMX • Import Templates
Monitoring Battle PlanThus Far • Start With the Basics !!!!!!Done!!!!! • Ping, Disk • Add Hadoop Specific Alarms !Done! • check_data_node • Add JMX Graphing !Done! • NameNodeOperations • Add JMX Based alarms • FilesTotal > 1,000,000 or LiveNodes < 50%
Add JMX based Alarms • hadoop-cacti-jtg is flexible • extend fetch classes • Don’t call output() • Write your own check logic
Quick JMX Base Walkthrough • url, user, pass, object specified from CLI • wantedVariables, wantedOperations by inheritance • fetch() output() provided
Monitoring Battle Plan • Start With the Basics !DONE! • Ping, Disk • Add Hadoop Specific Alarms !DONE! • check_data_node • Add JMX Graphing !DONE! • NameNodeOperations • Add JMX Based alarms !DONE! • FilesTotal > 1,000,000 or LiveNodes < 50%
Review • File System Growth • Size • Number of Files • Number of Blocks • Ratio’s • Utilization • CPU/Memory • Disk • Email (nightly) • FSCK • DSFADMIN
The Future • JMX Coming to JobTracker and TaskTracker (0.21) • Collect and Graph Jobs Running • Collect and Graph Map / Reduce per node • Profile Specific Jobs in Cacti?