Hadoop Operations – Best Practices from the Field

Hadoop Operations –Best Practices from the Field Chris Nauroth email: cnauroth@hortonworks.com twitter: @cnauroth Suresh Srinivas email: suresh@hortonworks.com twitter: @suresh_m_s October 17, 2014

About Us Chris Nauroth • Member of Technical Staff, Hortonworks • Apache Hadoop committer and PMC member • Major contributor to HDFS ACLs, Windows compatibility, and operability improvements • Hadoop user since 2010 • Prior employment experience deploying, maintaining and using Hadoop clusters Suresh Srinivas • Architect & Founder at Hortonworks • Long time Apache Hadoop committer and PMC member • Designed and developed many key Hadoop features • Experience from supporting many clusters • Including some of the world’s largest Hadoopclusters Architecting the Future of Big Data

Agenda • Analysis of Hadoop Support Cases • Support case trends • Configuration • Documentation • Software Improvements • Key Learnings and Best Practices • HDFS ACLs • HDFS Snapshots • YARN Application Timeline Server Architecting the Future of Big Data

Support Cases: Setting the Context • Hortonworks Support • Multiple tiers of support contacts • Support engineers trained and knowledgeable across the entire Hadoop ecosystem • Cases may escalate to subject matter experts for depth in one particular area • Challenging cases may escalate to Apache committers at Hortonworks if additional expertise is required • Apache Community Support • user@hadoop.apache.org for user questions and support • https://issues.apache.org/jira for reporting confirmed bugs • Apache Hadoopusers, contributors, committers and PMC members all participate actively in these forums to help resolve issues Architecting the Future of Big Data

Support Case Analysis Methodology • Inspected over 2 years of support case history across hundreds of customers • Broad inclusion of 29 Hadoop ecosystem and related projects • Multiple versions of Hadoop in deployments • 2 major versions: Hadoop 1.x and 2.x • ~3 minor versions within each major version • ~3 patch releases per minor version • ~15 total releases and updates • Distinct deployment environments • Cluster sizes ranging from 10s to 1000s of nodes • Different management environments and operational practices • Various deployment techniques: Ambari, Chef, RPMs, etc. Architecting the Future of Big Data

Support Case Trends – Cases per Month Architecting the Future of Big Data

Support Case Trends – Cases per Month • What is the spike in May 2014? • More users • More total users means more total support cases • More features • Many upgrades of existing clusters from Hadoop 1 to Hadoop 2 • Many conversions to HA deployments • Many conversions to secured deployments • More integration • Many sites running separate Hadoop 1 and Hadoop 2 clusters simultaneously • Questions around migrating data between clusters at 2 different versions (DistCp) Architecting the Future of Big Data

Support Case Trends – Proportional Cases per Month Architecting the Future of Big Data

Support Case Trends – Root Cause Architecting the Future of Big Data

Support Case Trends • Highlights • Core Hadoop components (HDFS, YARN and MapReduce) are used across all deployments, and therefore receive proportionally more support cases than other ecosystem components. • Misconfiguration is the dominant root cause. • Documentation is a close second. • We are constantly improving the code to eliminate operational issues, help with diagnosis and provide increased visibility. Architecting the Future of Big Data

Configuration

Hardware and Cluster Sizing • Considerations • Larger clusters heal faster on nodes or disk failure • Machines with huge storage take longer to recover • More racks give more failure domains • Recommendations • Get good-quality commodity hardware • Buy the sweet-spot in pricing: 3TB disk, 96GB, 8-12 cores • More memory is better – real time is memory hungry! • Before considering fatter machines (1U 6 disks vs. 2U 12 disks) • Get to 30-40 machines or 3-4 racks • Use pilot cluster to learn about load patterns • Balanced hardware for I/O, compute or memory bound • More details - http://tinyurl.com/hwx-hadoop-hw

Configuration • Avoid JVM issues • Use 64 bit JVM for all daemons • Compressed OOPS enabled by default (6 u23 and later) • Java heap size • Set same max and starting heapsize, Xmx== Xms • Avoid java defaults – configure NewSizeand MaxNewSize • Use 1/8 to 1/6 of max size for JVMs larger than 4G • Configure –XX:PermSize=128 MB, -XX:MaxPermSize=256 MB • Use low-latency GC collector • -XX:+UseConcMarkSweepGC, -XX:ParallelGCThreads=<N> • High <N> on Namenode and JobTracker or ResourceManager • Important JVM configs to help debugging • -verbose:gc -Xloggc:<file> -XX:+PrintGCDetails • -XX:ErrorFile=<file> • -XX:+HeapDumpOnOutOfMemoryError

Configuration • Multiple redundant dirs for namenode metadata • One of dfs.namenode.name.dirshould be on NFS • NFS softmount - tcp,soft,intr,timeo=20,retrans=5 • Configure open fdulimit • Default 1024 is too low • 16K for datanodes, 64K for Master nodes • Use version control for configuration!

Configuration • Use disk fail in place for datanodes: dfs.datanode.failed.volumes.tolerated • Disk failure is no longer datanode failure • Especially important for large density nodes • Set dfs.namenode.name.dir.restore to true • Restores NN storage directory during checkpointing • Take periodic backups of namenode metadata • Make copies of the entire storage directory • Set aside a lot of disk space for NN logs • It is verbose – set aside multiple GBs • Many installs configure this too small • NN logs roll with in minutes – hard to debug issues

Monitor Usage • Cluster storage, nodes, files, blocks grows • Update NN heap, handler count, number of DN xceivers • Tweak other related config periodically • Monitor the hardware usage for your work load • Disk I/O, network I/O, CPU and memory usage • Use this information when expanding cluster capacity • Monitor the usage with HADOOP metrics • JVM metrics – GC times, Memory used, Thread Status • RPC metrics – especially latency to track slowdowns • HDFS metrics • Used storage, # of files and blocks, total load on the cluster • File System operations • MapReduce Metrics • Slot utilization and Job status • Tweak configurations during upgrades/maintenance on an ongoing basis

Documentation

Documentation • Continual Investment in Documentation • Hortonworks Data Platform Documentation • http://docs.hortonworks.com/ • Apache Hadoop Documentation • http://hadoop.apache.org/docs/current/ • Apache Hadoop Documentation • We welcome your requests in Apache jira for documentation improvements. • Create issues with the “documentation” label. • Getting the end user perspective is extremely valuable. • We would be grateful to receive documentation patches. • It’s a great way to get started in the Apache Hadoop open source process. • Search for unresolved issues with the “documentation” label. • https://issues.apache.org/jira/issues/?jql=project%20in%20(HDFS%2C%20HADOOP%2C%20YARN%2C%20MAPREDUCE)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20documentation Architecting the Future of Big Data

Software Improvements Real Incidents and Software Improvements to Address Them

Don’t edit the metadata files! • Editing can corrupt the cluster state • Might result in loss of data • Real incident • NN misconfigured to point to another NN’s metadata • DNs can’t register due to namespace ID mismatch • System detected the problem correctly • Safety net ignored by the admin! • Admin edits the namenode VERSION file to match ids What Happens Next?

Improvement • Pause deletion of blocks when the namenode starts up • https://issues.apache.org/jira/browse/HDFS-6186 • Supports configurable delay of block deletions after NameNode startup • Gives an admin extra time to diagnose before deletions begin • Show when block deletion will start after NameNode startup in WebUI • https://issues.apache.org/jira/browse/HDFS-6385 • The web UI already displays the number of pending block deletions • This will enhance the display to indicate when actual deletion will begin Architecting the Future of Big Data

Guard Against Accidental Deletion • rm –r deletes the data at the speed of Hadoop! • ctrl-c of the command does not stop deletion! • Undeleting files on datanodes is hard & time consuming • Immediately shutdown NN, unmount disks on datanodes • Recover deleted files • Start namenode without the delete operation in edits • Enable Trash • Real Incident • Customer is running a distroof Hadoop with trash not enabled • Deletes a large dir (100 TB) and shuts down NN immediately • Support person asks NN to be restarted to see if trash is enabled! What happens next? • Now HDFS has Snapshots!

Improvement • HDFS Snapshots • https://issues.apache.org/jira/browse/HDFS-2802 • A snapshot is a read-only point-in-time image of part of the file system • A snapshot created before a deletion can be used to restore deleted data • More coverage of snapshots later in the presentation • HDFS ACLs • https://issues.apache.org/jira/browse/HDFS-4685 • Finer-grained control of file permissions can help prevent an accidental deletion • More coverage of ACLs later in the presentation Architecting the Future of Big Data

Unexpected error during HA HDFS upgrade • Background: HDFS HA Architecture • http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html • Real Incident • During upgrade, NameNode calls every JournalNode to request backup of metadata directory, which renames “current” directory to “previous.tmp”. • Permissions incorrect on metadata directory for 1 out of 3 JournalNodes. • The hdfs user is not authorized to rename. Backup fails for that JournalNode, so upgrade process aborts with error. What happens next? Architecting the Future of Big Data

Improvement • Improve diagnostics on storage directory rename operations by using native code. • https://issues.apache.org/jira/browse/HDFS-7118 • Logs additional root cause information for rename failure. For example, EACCES • Split error checks in into separate conditions to improve diagnostics. • https://issues.apache.org/jira/browse/HDFS-7119 • Splits a log message about failure to delete or rename into separate log messages to clarify which specific action failed • When aborting NameNode or JournalNode, write the contents of the metadata directories and permissions to logs. • https://issues.apache.org/jira/browse/HDFS-7120 • Usually the first information asked of the user, so we can automate this • For JournalNode operations that must succeed on all nodes, execute a pre-check to verify that the operation can succeed. • https://issues.apache.org/jira/browse/HDFS-7121 • Prevents need for manual cleanup on 2 out of 3 JournalNodes where backup succeeded Architecting the Future of Big Data

Support Case Trends • Highlights Revisited • Core Hadoop components (HDFS, YARN and MapReduce) are used across almost all deployments, and therefore receive proportionally more support cases than other ecosystem components. • Action: Focus efforts on core Hadoop first to improve operability of the platform. • Misconfiguration is the dominant root cause. • Action: Publish configuration best practices and advise on the need for ongoing review of configuration as cluster usage patterns change over time. • Documentation is a close second. • Action: Contribute frequently to product documentation, both in open source Apache Hadoop and in the distro. End user documentation is a gating factor for launching new features. We welcome your requests in Apache jira for documentation improvements, and we welcome your patches! • Code changes often can be implemented to eliminate an operational issue, help with diagnosis or provide increased visibility. • Action: After resolution of each support case, consider potential product improvements. For example, can logging be improved? Small code changes can have a big impact. Architecting the Future of Big Data

Key Learnings and Best Practices Features that Help Improve Production Operations

HDFS ACLs • Existing HDFS POSIX permissions good, but not flexible enough • Permission requirements may differ from the natural organizational hierarchy of users and groups. • HDFS ACLs augment the existing HDFS POSIX permissions model by implementing the POSIX ACL model. • An ACL (Access Control List) provides a way to set different permissions for specific named users or named groups, not only the file’s owner and file’s group. Architecting the Future of Big Data

HDFS File Permissions Example • Authorization requirements: • In a sales department, they would like a single user Maya (Department Manager) to control all modifications to sales data • Other members of sales department need to view the data, but can’t modify it. • Everyone else in the company must not be allowed to view the data. • Can be implemented via the following: Read perm for group sales Read/Write perm for user maya File with sales data Group User

HDFS ACLs • Problem • No longer feasible for Maya to control all modifications to the file • New Requirement: Maya, Diane and Clark are allowed to make modifications • New Requirement: New group called executives should be able to read the sales data • Current permissions model only allows permissions at 1 group and 1 user • Solution: HDFS ACLs • Now assign different permissions to different users and groups Group D … rwx Owner HDFS Directory … rwx Group F … rwx Group … rwx User Y … rwx Others … rwx

HDFS ACLs New Tools for ACL Management (setfacl, getfacl) • hdfsdfs -setfacl -m group:execs:r-- /sales-data • hdfsdfs -getfacl /sales-data # file: /sales-data # owner: maya# group: sales user::rw- group::r-- group:execs:r-- mask::r-- other::-- • How do you know if a directory has ACLs set? • hdfsdfs -ls /sales-data Found 1 items -rw-r-----+ 3 maya sales 0 2014-03-04 16:31 /sales-data

HDFS ACLs Default ACLs • hdfsdfs -setfacl -m default:group:execs:r-x /monthly-sales-data • hdfsdfs -mkdir /monthly-sales-data/JAN • hdfsdfs –getfacl /monthly-sales-data/JAN • # file: /monthly-sales-data/JAN # owner: maya# group: sales user::rwx group::r-x group:execs:r-x mask::r-x other::--- default:user::rwxdefault:group::r-x default:group:execs:r-xdefault:mask::r-x default:other::---

HDFS ACLs Best Practices • Start with traditional HDFS permissions to implement most permission requirements. • Define a smaller number of ACLs to handle exceptional cases. • A file with an ACL incurs an additional cost in memory in the NameNode compared to a file that has only traditional permissions. Architecting the Future of Big Data

HDFS Snapshots • HDFS Snapshots • A snapshot is a read-only point-in-time image of part of the file system • Performance: snapshot creation is instantaneous, regardless of data size or subtree depth • Reliability: snapshot creation is atomic • Scalability: snapshots do not create extra copies of data blocks • Useful for protecting against accidental deletion of data • Example: Daily Feedshdfsdfs -ls /daily-feedsFound 5 itemsdrwxr-xr-x - chrissupergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-13drwxr-xr-x - chrissupergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-14drwxr-xr-x - chrissupergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-15drwxr-xr-x - chrissupergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-16drwxr-xr-x - chrissupergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-17 Architecting the Future of Big Data

HDFS Snapshots • Create a snapshot after each daily loadhdfsdfsadmin -allowSnapshot /daily-feedsAllowing snaphot on /daily-feeds succeededhdfsdfs -createSnapshot /daily-feeds snapshot-to-2014-10-17Created snapshot /daily-feeds/.snapshot/snapshot-to-2014-10-17 • User accidentally deletes data for 2014-10-16hdfsdfs -ls /daily-feedsFound 4 itemsdrwxr-xr-x - chrissupergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-13drwxr-xr-x - chrissupergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-14drwxr-xr-x - chrissupergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-15drwxr-xr-x - chrissupergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-17 Architecting the Future of Big Data

HDFS Snapshots • Snapshots to the rescue: the data is still in the snapshothdfsdfs -ls /daily-feeds/.snapshot/snapshot-to-2014-10-17Found 5 itemsdrwxr-xr-x - chrissupergroup 0 2014-10-13 14:36 /daily-feeds/.snapshot/snapshot-to-2014-10-17/2014-10-13drwxr-xr-x - chrissupergroup 0 2014-10-13 14:36 /daily-feeds/.snapshot/snapshot-to-2014-10-17/2014-10-14drwxr-xr-x - chrissupergroup 0 2014-10-13 14:37 /daily-feeds/.snapshot/snapshot-to-2014-10-17/2014-10-15drwxr-xr-x - chrissupergroup 0 2014-10-13 14:37 /daily-feeds/.snapshot/snapshot-to-2014-10-17/2014-10-16drwxr-xr-x - chrissupergroup 0 2014-10-13 14:37 /daily-feeds/.snapshot/snapshot-to-2014-10-17/2014-10-17 • Restore data from 2014-10-16hdfsdfs -cp /daily-feeds/.snapshot/snapshot-to-2014-10-17/2014-10-16 /daily-feeds Architecting the Future of Big Data

YARN Application Timeline Server • Stores data about YARN application execution • Generic data • YARN container utilization • Metrics related to containers • Application-specific data • MapReduce jobs and their tasks • Tez DAG execution • Provides CLI for accessing data • Useful for ad-hoc queries or scripted analysis • Provides REST API for accessing data • Consumed by UI front-ends such as Apache Ambari Architecting the Future of Big Data

Querying a Map Reduce Job Entity curlhttp://127.0.0.1:8188/ws/v1/timeline/MAPREDUCE_JOB/job_1413405332088_0001 { "entity": "job_1413405332088_0001", "entitytype": "MAPREDUCE_JOB", "events": [ { "eventinfo": { "FINISHED_MAPS": 2, "FINISHED_REDUCES": 1, "FINISH_TIME": 1413405349192, "JOB_STATUS": "SUCCEEDED" }, "eventtype": "JOB_FINISHED", "timestamp": 1413405349194 } ], "relatedentities": { "MAPREDUCE_TASK": [ "task_1413405332088_0001_m_000000" ] }, "starttime": 1413405339442 } Architecting the Future of Big Data

Querying a Map Task Entity curl http://127.0.0.1:8188/ws/v1/timeline/MAPREDUCE_TASK/task_1413405332088_0001_m_000000 { "entity": "task_1413405332088_0001_m_000000", "entitytype": "MAPREDUCE_TASK", "events": [ { "eventtype": "TASK_FINISHED", "timestamp": 1413405345253 }, { "eventinfo": { "SPLIT_LOCATIONS": "localhost", "START_TIME": 1413405340255, "TASK_TYPE": "MAP" }, "eventtype": "TASK_STARTED", "timestamp": 1413405340258 } ], } Architecting the Future of Big Data

Summary • Configuration • Prevent garbage collection issues • Configure for redundancy • Retune configuration in response to metrics • Documentation • End user perspective is crucial • Please consider contributing to Apache Hadoop documentation • HDFS ACLs • Implement fine-grained authorization rules on files • Can protect against accidental file manipulations • HDFS Snapshots • Point-in-time image of part of the filesystem • Useful for restoring to a prior state after accidental file manipulation • YARN Application Timeline Server • Provides generic and application-specific data about YARN application execution • Useful for analyzing cluster usage patterns Architecting the Future of Big Data

Thank you, Q&A Learn more

Hadoop Operations – Best Practices from the Field