320 likes | 453 Views
Synthesizing Representative I/O Workloads for TPC-H. J. Zhang *, A. Sivasubramaniam *, H. Franke, N. Gautam *, Y. Zhang, S. Nagar * Pennsylvania State University IBM T.J. Watson Rutgers University. Outline. Motivation Related Work Methodology Arrival Time Access Pattern Request Sizes
E N D
Synthesizing Representative I/O Workloads for TPC-H J. Zhang*,A. Sivasubramaniam*, H. Franke,N. Gautam*, Y. Zhang, S. Nagar * Pennsylvania State University IBM T.J. Watson Rutgers University
Outline • Motivation • Related Work • Methodology • Arrival Time • Access Pattern • Request Sizes • Accuracy of synthetic traces • Concluding Remarks
Motivation • I/O subsystems are critical for commercial services and in production environments. • Real applications are essential for system design and evaluation. • TPC-H is a decision-support workload for business enterprises.
Disadvantages of Traces • Not easily obtainable • Can be very large • Difficult to get statistical confidence • Very difficult to change workload behavior • Does not isolate the influence of one parameter • On the other hand, a deeper understanding of the workload can: • Help generate a synthetic workload • Help in system design itself.
What do we need to synthesize? • Inter-arrival times (temporal behavior) of disk block requests. • Access pattern (spatial behavior) of blocks being referenced • Size (volume) of each I/O request.
Related work • Scientific Application I/O behavior • Time-series models for arrivals • Sequentiality/Markov models for access pattern • Commercial/production workloads • Self-similar arrival patterns • Sequentiality in TPC-H/TPC-D • No prior complete synthesis of all three attributes for TPC-H
Our TPC-H Workload • Trace Collection Platform • IBM Netfinity 8-way SMP with 2.5GB memory and 15 disks • Linux 2.4.17 • DB2 UDB EE V7.2 • TPC-H Configuration • Power Run of 22 queries • Partitioning tables across the disks • 30 GB dataset
CDF Response time Validation Original I/O traces Identify characteristics Generate synthetic traces Disksim 2.0 Metrics • RMS: root-mean-square error of differences between two CDF curves • nRMS: RMS/m, m is average response time for the original trace
Overall Methodology • Arrival pattern characteristics • Investigate correlations • Time series • Self-similar • iid distributions • Access pattern characteristics • Sequentiality/pseudo sequentiality/randomness • Size characteristics • Investigating correlations between time, space and volume to get final synthesis
Arrival pattern • Statistical analysis • Auto-correlation function (ACF) plots • Shows the correlation between current inter-arrival time and one that is x-steps away
Correlations seem very weak (<0.15 for 12 queries, and <0.30 for the rest) • Errors with Time series models (AR/MA/ARIMA/ARFIMA) are high • No suggestions for self-similar either • Perhaps iid (independent and identically distributed) is not a bad assumption.
Fitting distributions • Tried hyper-exponential/normal/pareto • Used Maximum Likelihood Estimator (normal/pareto) and Expectation Maximization (hyper-exponential) to estimate distribution parameters • Use K-S test to measure goodness-of-fit • Maximum distance between fitted distribution and original CDF was ensured to be less than 0.1
Access Pattern (Location + Size) • Most studies use sequentiality to describe TPC-H • However, this is not always the case. Location Location Location Arrival Time Arrival Time Arrival Time Cat1: Q10 Q4, Q14 Cat2: Q12, Q1,Q3,Q5,Q7, Q8,Q15,Q18, Q19,Q21 Cat3: Q20 Q9, Q17
Category 1: Intermingling sequential streams • Consider the following: • Run: A strictly sequential set of I/O requests • Stream: A pseudo-sequential set of I/O requests that could be interrupted by another stream. • i.e. a stream could have several runs that are interrupted by runs of other streams.
1-4 5-8 9-10 11-14 15-18 1-4 7-8 9-12 11-14 Stream A 1-4 7-8 9-12 11-14 100-104 105-108 109-112 Stream B Trace 1-4 100-104 7-8 9-12 105-108 109-112 11-14 Run and Stream An example run of 5 requests A stream (pseudo-sequential) of 4 requests An example trace:
Secondary Attributes • Run Length: # of requests in a run • Run Start location: start sector of run • Stream Length: # of requests in a stream • Inter-stream Jump Distance: spatial separation between start of run and previous request • Intra-stream Jump Distance: spatial separation between successive requests within a stream • Number of active streams (at any instant) • Interference Distance: number of requests between 2 successive requests in a stream • Derive empirical distributions for these from the trace
Location Synthesis - Q10(Time and size from trace) • LocIID: locations are i.i.d. • LocRUN: incorporate run length distribution and run start location distribution. • LocSTREAM: combine all stream and run statistics.
Request Size • Requests are one of • 64, 128, 192, 256, 320, 384, 448, 512 blocks • But attributes (location, size, time) are not independent !!!
Size All req. Run start Within run Correlations between size and location Fraction of requests
Final Synthesis Methodology (Category 1) • Location: use LocSTREAM to generate start locations. Two kinds of requests: a run start request or a request within a run • Time: use Pr(inter-arrival time | run start requests) and Pr(inter-arrival time | within a run requests) to generate times. • Size: • For run start request, use Pr(size | inter-arrival times of run start requests) to generate sizes. • For within a run requests, use Pr(size | within a run requests) to generate sizes.
Can be easily adapted for Category 2 (strictly sequential) and Category 3 (random) queries. • Validation: Compare the response time characteristics of synthesized and real trace.
Storage Requirements Storage Fraction(x0.001) nRMS Storage Fraction(x0.001) nRMS
Contributions • A synthesis methodology to capture • Inter-mingling streams of requests • Exploiting correlations between request attributes • An application of this methodology to TPC-H • Along the way (for TPC-H), • iid can capture arrival time characteristics • Strict sequentiality is not always the case
LocSTREAM • Use Pr(stream length) to generate stream lengths. • Use Pr(run length | stream length) to generate run lengths for each stream length. • Generate start location for each run: • Use Pr(inter-stream jump dist.) to generate the start location of the first run in the stream. • Use Pr(intra-stream jump distance | this stream) to generate other runs’ start location in this stream. • Use Pr(interference distance) to interleave all streams.