140 likes | 274 Views
BALANCED DATA LAYOUT IN HADOOP. CPS 216 Kyungmin (Jason) Lee Ke (Jessie) Xu Weiping Zhang. Background. How data is stored on HDFS affects Hadoop MapReduce performance Mapper phase: decreased performance if need to fetch input data from remote node across network
E N D
BALANCED DATA LAYOUT IN HADOOP CPS 216 Kyungmin (Jason) Lee Ke (Jessie) Xu Weiping Zhang
Background • How data is stored on HDFS affects Hadoop MapReduce performance • Mapper phase: decreased performance if need to fetch input data from remote node across network • Imbalance during a MapReduce workflow (output from one job used as input to next) makes problem even worse • Project goal: minimize the need to fetch Map input across network by balancing input data across nodes
Previous Work • Reactive Solution – HDFS Rebalancer • Algorithm to rebalance data layout in HDFS based on storage utilization • Reacts to already-existing data layout imbalance, would like way to prevent altogether • Proactive Solution – RR Block Placement Policy • On HDFS writes, choose target node in round robin fashion, so data guaranteed balance • Unnecessary writes across network? Can we do better?
Balanced Block Placement Policy • Do writes ‘greedily’ as long as cluster is ‘fairly balanced’ • ‘Greedily’ = prioritize target nodes based on location • Local node > node on rack > remote node • ‘Fairly balanced’ = size of all nodes fall within a specified ranged (windowSize) • Algorithm: • Sort live nodes on HDFS used; threshold = max – windowSize • 1st replica: write to local node if it is below threshold or if all nodes are above threshold, otherwise write to least utilized node • 2nd replica: least utilized node that is on different rack (if possible) than 1st replica • 3rd replica and beyond: least utilized remaining node
Test Workloads • 4-node cluster • Default Policy (DP) vs. Balanced Policy (BP) • 2 MapReduce Jobs • Balanced Sort (each reducer approx. same output size) • Skewed Sort (skewed reducer output sizes) • 2 Workloads • Single run, vary number reducers (1, 2, 4, 10, 12) • Cascaded workflow, 3 sorts in series, reducers = 10 • Other parameters • RF (replication factor) – 1 and 3 • Speculative Execution on and off (SE vs NSE) • Monitor amount of data written to node by standard deviation higher StdDev implies more imbalance
Balanced Sort Single Run • DP very skewed for reducers < 4, as expected • Otherwise both pretty balanced (as expected)
Skewed Sort Single Run • DP significantly worse than BP • RF3 show better balance than RF1 • Disabling SE improves balance in BP
Performance • No significant overhead/improvements observed
Speculative Execution • Hadoop performance feature that runs same task on 2 nodes concurrently, uses data from task that completes first and discards the other • Usually occurs toward end of a job, leading to unintended data imbalance in balanced policy • Turning off speculative execution improved data balance, but in practice would like to keep this feature on for performance boost • Our policy too greedy, less affected if a node writes approximately equally to all nodes round robin • Hybrid policy, some nodes run round robin and some nodes run balanced policy? Tradeoff between balance and network traffic?
Future Considerations • Current implementation assumes data will be balanced throughout cluster’s lifetime • What if some nodes are down for a period of time and data becomes imbalanced? • Data output per job should be spread evenly, vs. overall data layout spread evenly • Need additional knowledge of which job each write belongs to • Effect of window size on balance/performance? • Unable to test due to insufficient funds
Conclusion • Implemented new block placement policy that focuses on maintaining data balance while keeping writes local as much as possible • Test data showed success at maintaining data balance • Greatest improvements with skewed outputs • Performance not affected – would expect improvement for skewed datasets given reduction in network usage • Only tested on small cluster with small datasets • Should be more effective on large datasets • Performance weakened by speculative execution • In practice should tweak our policy to get best performance results