200 likes | 302 Views
A case for resource discovery in shared distributed platforms. David Oppenheimer. UCB ROC Retreat 12 January 2005. Introduction. Application performance is a function of resources available to the application resources needed by the application
E N D
A case for resource discovery in shared distributed platforms David Oppenheimer UCB ROC Retreat12 January 2005
Introduction • Application performance is a function of • resources available to the application • resources needed by the application • or, “application sensitivity to resource constraints” • At summer retreat, described SWORD • at app deployment time, find best set of nodes given • resources available on a set of distributed nodes • application sensitivity to resource constraints • assumptions • available resources vary among nodes enough to matter • spare CPU, mem, disk space; inter-node latency, avail. bw; ... • applications are sensitive to resource constraints enough to matter • Focus of this talk: verify assumption (1)
Introduction (cont.) • Questions we will address • is there enough variation among nodes at any given (deployment) time to justify service placement? • is there enough variation over time on a single node to justify periodic task migration? • are there correlations between attributes on a single node, or among nodes at the same site? • All of these questions are important in designing a system for resource discovery and service placement (like SWORD)
Outline • How much does the available amount of per-node resources vary among nodes at a fixed time? • How much does the available amount of per-node resources vary over time? How much do inter-node latency and available bandwidth vary over time? • On a given node, are any per-node attributes strongly correlated? Are inter-node latency and available bandwidth correlated?
Experimental environment • Per-node attributes: Ganglia, CoMon • two-week period (Oct 10-Oct 24, 2004) • each node polled every 5 minutes • free memory, free swap, free disk, load average, network bytes sent and received/sec, # active slices • Inter-node latency: all-pairs pings • one month period ending Oct 24, 2004 • each pair of nodes measured every 15 minutes • Inter-node bandwidth: Iperf • one month period ending Oct 24, 2004 • each pair of nodes measured 1-2x/week • About 250 nodes in the trace each day
Outline • How much does the available amount of per-node resources vary among nodes at a fixed time? • How much does the available amount of per-node resources vary over time? How much do inter-node latency and available bandwidth vary over time? • On a given node, are any per-node attributes strongly correlated? Are inter-node latency and available bandwidth correlated?
Resource heterogeneity: averages • How much does available resources vary over the trace?
Resource heterogeneity: averages • How much does available resources vary over the trace?
Outline • How much does the available amount of per-node resources vary among nodes at a fixed time? • How much does the available amount of per-node resources vary over time? How much do inter-node latency and available bandwidth vary over time? • On a given node, are any per-node attributes strongly correlated? Are inter-node latency and available bandwidth correlated?
Variability of per-node attributes over time • Can rank degree of variability of each attribute • disk, swap < mem, load < net bytes; #slices mod to sig. • CDF curve shifts to right as interval length incrs. • attributes vary less over short time periods than long • migration interval: find “sweet spot” in curve of variability vs. interval length • CDF slope decreases as median var. of attr. incr. • may be able to classify nodes as high/low var. over time for mem, load, net bytes (they have high median var.)
Inter-node latency and BW variation over time • Most nodes have low latency (and bw) variability even over a month-long trace • migration may not be worthwhile
Outline • How much does the available amount of per-node resources vary among nodes at a fixed time? • How much does the available amount of per-node resources vary over time? How much do inter-node latency and available bandwidth vary over time? • On a given node, are any per-node attributes strongly correlated? Are inter-node latency and available bandwidth correlated?
Correlation among per-node attributes • No strong correlations between different attrs. • though some one-hour trace segments had some • Some correlation between nodes at same site
Correlation between latency and avail BW r=-.59 • Moderate inverse power law correlation • Using latency to estimate BW gives 233% error • some nodes are bandwidth-capped, some in weird ways • Some node pairs showed strong lat-BW correlation • 17% within 25%, 56% within 50%
Conclusion • How much does the available amount of per-node resources vary among nodes at a fixed time?significantly; enough to warrant svc. placement • How much does the available amount of per-node resources vary over time? How much do inter-node latency and available bandwidth vary over time?moderate variability; may warrant migration • On a given node, are any per-node attributes strongly correlated? Are inter-node latency and available bandwidth correlated?no strong correlation between diff. attrs. some correlation between same attr, same site latency can predict avail. bandwidth
Future work • Ask same questions but use application model to answer, rather than analysis of raw data • different apps have different resource sensitivities • different apps have different migration costs • Can we predict attribute values? • give warning before migration • or just don’t bother to deploy on “bad” nodes • How much “better” could we do if SWORD could schedule jobs?