160 likes | 326 Views
A Cloud Data Center Optimization Approach using Dynamic Data Interchanges. Prof. Stephan Robert http:// www.stephan-robert.ch University of Applied Sciences of Western Switzerland IEEE CloudNet San Francisco November 2013. Motivation and background.
E N D
A Cloud Data Center Optimization Approach using Dynamic Data Interchanges Prof. Stephan Robert http://www.stephan-robert.ch University of Applied Sciences of Western Switzerland IEEE CloudNet San Francisco November 2013
Motivation and background • Distributed datacenters in the Cloud have become popular ways to increase data availability and reducing costs • Cloud storage has received a lot of attention with a view to reduce costs: • Minimizing infrastructure and running costs • Allocation of data servers to customers • Geo-optimization (look at locations of where customers are to decide where to place datacenters)
Datacenter optimization • Research areas on optimizing datacenter operations: • Energy and power management • Cost benefit analysis • Cloud networks versus Grids • Geo-distribution of cloud centers • Multi-level caching
Motivation and background (cont.) • We consider the operational situation when we have decided on the datacenter locations. • Is there any other optimization we can perform? • Problem we examine: • Data locality: users not always near the data -> higher costs • Situation can change over time: we can decide to place our data near the users now, but there is no guarantee this will not change in the future
Principal idea • We consider a model for actively moving data closer to the current users. • When needed, we move data from one server to a temporary (cache) area in a different server. • In the near future, when users request this particular data, we can serve them from the local cache.
Benefits • Benefit of copying (caching) data to a local server: • We correct the mismatch between where the data is and where the users are. • We only copy once (cost), read many (benefit). • We ‘train’ the algorithm by using a history of requests to determine the relative frequency of items being requested (in an efficient way, as the number can be very large).
Model • We consider a combinatorial optimization model to determine the best placement of the data • This model will tell us if we need to copy data from one datacenter to another, in anticipation of user requests. • The optimization aim is to minimize the total expected cost of serving the future user data requests • The optimization constraints are the cache size capacities. • The model accounts for: • The cost of copying data between datacenters • The relative cost/benefit of delivering the data from a remote vs. a local server • The likelihood that particular data will be requested in particular locations in the near future
Model Probability object i will be requested by user u Cost of copying object i from default datacenter to another datacenter d Expected cost of retrieving object i from datacenter d The cache size Z of each datacenter must not be exceeded if object i is obtained from datacenter d Each object must be available in at least one datacenter
Operational aspects • Firstly, we must obtain a historical log of requests, including who requested what, where the file was located and file size. • We use this information to calculate the access probabilities in the model (in practice, using Hbase/Hadoop in a distributed manner). • The costs in the model have to be decided based on the architecture etc (eg the relative benefit of using a local server versus a remote one for a particular user. • Periodically (eg daily) we run the algorithm to determine any data duplication that is beneficial to do. • (Of course, the network must be aware of the local copies and know to use them).
Computational experimentation • Computational experimentation carried out in a simulation environment (no real-life implementation at this stage) • We measured the costs/benefits of obtaining the data directly against using our optimization model to ‘rearrange’ the data periodically • Consistent performance for 3, 5, 10 datacenters.
Computational experimentation • Setup of N datacenters located on a circle • Users placed at random inside the circle • Costs linked to the distance • Data object requests were generated from Zipf distribution (independently for each user) • First half if data used to train the algorithm (historic access log), the second half used for the simulation.
Promising results with ~ 20% cost reduction on average Full results appear in the proceedings paper Simulation results
Practicalities – is the idea feasible in a real system? • More complexities but also easy solutions • Time criticality: no need to use on live system, can optimize object locations overnight “periodic dynamic reconfiguration” • Metadata storage: need to store object access frequencies to calculate the probabilities p. Implemented a metadata storage in HBase on a Hadoop cluster. –> conclusion “feasible and easy”
Complexity issues • Optimization problem is complex (NP hard) to solve. • Can keep input size small: We only need to consider the most popular objects.. • Currently developing a fast heuristic algorithm based on knapsack methods • Standard problems of data • Other complexities: legal issues of moving data across countries (if personal data are involved)
Thank you Questions?