1 / 16

DuraStore – Achieving Highly Durable Data Centers

Design a durable data center storage system that minimizes data loss under correlated failure events. Improve durability without sacrificing load balance and scalability. Analysis, modeling, design, implementation, and testing tasks involved.

mchristina
Download Presentation

DuraStore – Achieving Highly Durable Data Centers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DuraStore – Achieving Highly Durable Data Centers Presenter: Wei Xie

  2. Data-Intensive Scalable Computing Laboratory • Mission statement • Broad research interests in parallel and distributed computing, high-performance computing, cloud computing with a focus on building scalable computing systems for data-intensive applications • Projects (over $2M external grants in past 5 years, from NSF/DOE/DOD) • OpenSoC HPC: Open Source, Extensible High Performance Computing Platform (DOD/LBNL) • Compute on Data Path: Combating Data Movement in HPC (NSF) • Unistore: A Unified Storage Architecture (NSF-IUCRC/Nimboxx) • Development of a Data-Intensive Scalable Computing Instrument (NSF) • Decoupled Execution Paradigm for Data-Intensive High-End Computing (NSF) • Active Object Storage for Big Data Applications (DOE/ANL) • Faculty/staff members • Dr. Yong Chen • Dr. Yu Zhuang • Dr. Dong Dai • Dr. Jiang Zhou • Ph.D. student members • Mr. Ghazanfar Ali • Ms. Elham Hojati • Mr. John Leidel • Ms. Neda Tavakoli • Mr. Xi Wang • Mr. Wei Xie • Mr. Wei Zhang • Masters/UG student members • Ms. Priyanka Kumari, Mr. Ali Nosrati, Mr. Yuan Cui, Mr. Frank Conlon, Ms. Yan Mu, Mr. Zachary Hansen, Mr. Yong Wu, Mr. Alex Weaver • Website: http://discl.cs.ttu.edu/

  3. Outline • Project Goals, Motivations, Challenges • Project Overview • Project team members • Background and Related Research • Overview of project tasks • Activities and outcomes • Deliverables and benefits • LIFE form input

  4. Project Goals, Motivation, Challenges • Project Goals • Design a data center storage system that is significantly more durable under correlated failure events. • Motivations • Correlated failure events (e.g. site-wide power outage) • Random replication used in mainstream storage system has high data loss probability • It takes high cost to recover data loss resulted from correlated failures • Both industry and research groups reported this problem • Challenges • Recently emerged durability aware data replication scheme improves durability but sacrifices load balance, scalability, and many other features of storage systems

  5. Project Overview • Core component: durability aware data replication/erasure coding • Use copyset and combination theory to minimize the correlated failure data loss probability • Compatible with existing data replication/erasure coding scheme with plug-in replacement • Data store software and storage infrastructure should be modified to be compatible Durable distributed/parallel data store (HDFS, GFS, RAMCloud, Ceph) Durability aware data replication/erasure coding Durable storage cluster

  6. Project Team Members • Faculty • Dr. Yong Chen Assistant Professor of CS, Texas Tech University • Dr. Jiang Zhou Postdoctoral Researcher of CS, Texas Tech University • Students • Wei Xie CS Ph.D. Candidate, Texas Tech University • Expertise • Storage systems • Parallel/distributed file systems • Cloud software stack • Data models and data management

  7. Background and Related Research • Correlated failures in Data Centers • In a cluster-wide failure event such as power outage occurs, about 0.5-1% of the nodes fail to reboot (from report of Yahoo! and LinkedIn) • Long fixed time to recover lost data (locating lost data) • Copyset Data Replication Scheme • A copyset is a set of servers that a data chunk is replicated to • The total number of unique copysets in the system determines the data loss probability under correlated failure • Minimize the total number of unique copysets could significantly reduce the data loss probability (100% to 0.1%) • Geo-replication • Replication on remote site to achieve better durability

  8. Overview of Project Tasks • Task 1: Analysis and modeling of mainstream data stores in terms of data durability • Task2: Design a new data replication or erasure coding scheme • Task 3: Apply the proposed data replication or erasure coding scheme on the targeted system • Task 4: Test the durability and conduct experiments to evaluate the load balance, scalability, and overhead of the new scheme

  9. Task 1: Analysis and Modeling • Task 1: Analysis and modeling of mainstream data stores in terms of data durability • Data redundancy scheme (replication or erasure coding) • Correlated failure durability model

  10. Task 2: Design • Task2: Design a new data replication or erasure coding scheme • Enhance data durability without sacrificing load balance and scalability • For load balance and scalability, a consistent hashing based replication is a good candidate • It is challenging to improve the durability of consistent hashing based replication • Solutions will be developed based upon extensive prior R&D

  11. Task 3: Implementation • Task 3: Apply the proposed data replication or erasure coding scheme on the targeted system • Target systems: Sheepdog, Ceph, and HDFS • A prototype system with the newly proposed durable data replication or erasure coding scheme is planned • The code developed will be free to use

  12. Task 4: Evaluation • Task 4: Test the durability and conduct experiments to evaluate the load balance, scalability, and overhead of the new scheme • Small-scale test will be conducted • Large-scale simulation to complement the test

  13. Activities and outcomes • Activities • Preliminary study and prior R&D in this space • Modeling of mainstream data store systems • Outcomes • Knowledge and findings from a systematic study on the correlated failure durability • Mathematical model and a simulator for simulating the data loss in data centers under correlated failures • New data replication algorithm or erasure coding scheme for DuraStore and prototype implementation source codes

  14. Deliverables and benefits • Deliverables • One or more correlated-failure durability aware data replication or erasure coding schemes for data center storage • Evaluation results from the simulation and experiment of the implementation of the proposed schemes • Report/paper presenting the design and evaluation results • Benefits • Access to the DuraStore design and prototype and free use of IP generated • Enhanced productivity and utilization of cloud storage systems • Less operational cost to data centers • Less maintenance and trouble-shooting manpower and resources for data centers • Simpler fault-tolerance design of data centers • Collaboration with faculty/staff and graduate students • Access to reports and papers • Recruitment and spin-off opportunities

  15. Preliminary Results and Publications • W. Xie and Y. Chen. Elastic Consistent Hashing for Distributed Storage Systems, IPDPS’17 • J. Zhou, W. Xie, D. Dai and Y. Chen. Pattern-Directed Replication Scheme for Heterogeneous Object-based Storage, CCGrid’17 • J. Zhou, W. Xie, J. Noble, K. Echo and Y. Chen. SUORA: A Scalable and Uniform Data Distribution Algorithm for Heterogeneous Storage Systems, NAS’16 • W. Xie, J. Zhou, M. Reyes, J. Noble and Y. Chen. Two-Mode Data Distribution Scheme for Heterogeneous Storage in Data Centers, BigData’15 • Sheepdog with Elastic Consistent Hashing (IPDPS'17 paper): http://discl.cs.ttu.edu/gitlab/xiewei/ElasticConsistentHashing • Copyset consistent hashing on lib-ch-placement: http://xiewei@discl.cs.ttu.edu/gitlab/xiewei/ch-placement-copyset.git

  16. LIFE Form Input Please take a moment to fill out your L.I.F.E. forms. http://www.iucrc.com Select “Cloud and Autonomic Computing Center” then select “IAB” role. What do you like about this project? What would you change? (Please include all relevant feedback.)

More Related