160 likes | 173 Views
Design a durable data center storage system that minimizes data loss under correlated failure events. Improve durability without sacrificing load balance and scalability. Analysis, modeling, design, implementation, and testing tasks involved.
E N D
DuraStore – Achieving Highly Durable Data Centers Presenter: Wei Xie
Data-Intensive Scalable Computing Laboratory • Mission statement • Broad research interests in parallel and distributed computing, high-performance computing, cloud computing with a focus on building scalable computing systems for data-intensive applications • Projects (over $2M external grants in past 5 years, from NSF/DOE/DOD) • OpenSoC HPC: Open Source, Extensible High Performance Computing Platform (DOD/LBNL) • Compute on Data Path: Combating Data Movement in HPC (NSF) • Unistore: A Unified Storage Architecture (NSF-IUCRC/Nimboxx) • Development of a Data-Intensive Scalable Computing Instrument (NSF) • Decoupled Execution Paradigm for Data-Intensive High-End Computing (NSF) • Active Object Storage for Big Data Applications (DOE/ANL) • Faculty/staff members • Dr. Yong Chen • Dr. Yu Zhuang • Dr. Dong Dai • Dr. Jiang Zhou • Ph.D. student members • Mr. Ghazanfar Ali • Ms. Elham Hojati • Mr. John Leidel • Ms. Neda Tavakoli • Mr. Xi Wang • Mr. Wei Xie • Mr. Wei Zhang • Masters/UG student members • Ms. Priyanka Kumari, Mr. Ali Nosrati, Mr. Yuan Cui, Mr. Frank Conlon, Ms. Yan Mu, Mr. Zachary Hansen, Mr. Yong Wu, Mr. Alex Weaver • Website: http://discl.cs.ttu.edu/
Outline • Project Goals, Motivations, Challenges • Project Overview • Project team members • Background and Related Research • Overview of project tasks • Activities and outcomes • Deliverables and benefits • LIFE form input
Project Goals, Motivation, Challenges • Project Goals • Design a data center storage system that is significantly more durable under correlated failure events. • Motivations • Correlated failure events (e.g. site-wide power outage) • Random replication used in mainstream storage system has high data loss probability • It takes high cost to recover data loss resulted from correlated failures • Both industry and research groups reported this problem • Challenges • Recently emerged durability aware data replication scheme improves durability but sacrifices load balance, scalability, and many other features of storage systems
Project Overview • Core component: durability aware data replication/erasure coding • Use copyset and combination theory to minimize the correlated failure data loss probability • Compatible with existing data replication/erasure coding scheme with plug-in replacement • Data store software and storage infrastructure should be modified to be compatible Durable distributed/parallel data store (HDFS, GFS, RAMCloud, Ceph) Durability aware data replication/erasure coding Durable storage cluster
Project Team Members • Faculty • Dr. Yong Chen Assistant Professor of CS, Texas Tech University • Dr. Jiang Zhou Postdoctoral Researcher of CS, Texas Tech University • Students • Wei Xie CS Ph.D. Candidate, Texas Tech University • Expertise • Storage systems • Parallel/distributed file systems • Cloud software stack • Data models and data management
Background and Related Research • Correlated failures in Data Centers • In a cluster-wide failure event such as power outage occurs, about 0.5-1% of the nodes fail to reboot (from report of Yahoo! and LinkedIn) • Long fixed time to recover lost data (locating lost data) • Copyset Data Replication Scheme • A copyset is a set of servers that a data chunk is replicated to • The total number of unique copysets in the system determines the data loss probability under correlated failure • Minimize the total number of unique copysets could significantly reduce the data loss probability (100% to 0.1%) • Geo-replication • Replication on remote site to achieve better durability
Overview of Project Tasks • Task 1: Analysis and modeling of mainstream data stores in terms of data durability • Task2: Design a new data replication or erasure coding scheme • Task 3: Apply the proposed data replication or erasure coding scheme on the targeted system • Task 4: Test the durability and conduct experiments to evaluate the load balance, scalability, and overhead of the new scheme
Task 1: Analysis and Modeling • Task 1: Analysis and modeling of mainstream data stores in terms of data durability • Data redundancy scheme (replication or erasure coding) • Correlated failure durability model
Task 2: Design • Task2: Design a new data replication or erasure coding scheme • Enhance data durability without sacrificing load balance and scalability • For load balance and scalability, a consistent hashing based replication is a good candidate • It is challenging to improve the durability of consistent hashing based replication • Solutions will be developed based upon extensive prior R&D
Task 3: Implementation • Task 3: Apply the proposed data replication or erasure coding scheme on the targeted system • Target systems: Sheepdog, Ceph, and HDFS • A prototype system with the newly proposed durable data replication or erasure coding scheme is planned • The code developed will be free to use
Task 4: Evaluation • Task 4: Test the durability and conduct experiments to evaluate the load balance, scalability, and overhead of the new scheme • Small-scale test will be conducted • Large-scale simulation to complement the test
Activities and outcomes • Activities • Preliminary study and prior R&D in this space • Modeling of mainstream data store systems • Outcomes • Knowledge and findings from a systematic study on the correlated failure durability • Mathematical model and a simulator for simulating the data loss in data centers under correlated failures • New data replication algorithm or erasure coding scheme for DuraStore and prototype implementation source codes
Deliverables and benefits • Deliverables • One or more correlated-failure durability aware data replication or erasure coding schemes for data center storage • Evaluation results from the simulation and experiment of the implementation of the proposed schemes • Report/paper presenting the design and evaluation results • Benefits • Access to the DuraStore design and prototype and free use of IP generated • Enhanced productivity and utilization of cloud storage systems • Less operational cost to data centers • Less maintenance and trouble-shooting manpower and resources for data centers • Simpler fault-tolerance design of data centers • Collaboration with faculty/staff and graduate students • Access to reports and papers • Recruitment and spin-off opportunities
Preliminary Results and Publications • W. Xie and Y. Chen. Elastic Consistent Hashing for Distributed Storage Systems, IPDPS’17 • J. Zhou, W. Xie, D. Dai and Y. Chen. Pattern-Directed Replication Scheme for Heterogeneous Object-based Storage, CCGrid’17 • J. Zhou, W. Xie, J. Noble, K. Echo and Y. Chen. SUORA: A Scalable and Uniform Data Distribution Algorithm for Heterogeneous Storage Systems, NAS’16 • W. Xie, J. Zhou, M. Reyes, J. Noble and Y. Chen. Two-Mode Data Distribution Scheme for Heterogeneous Storage in Data Centers, BigData’15 • Sheepdog with Elastic Consistent Hashing (IPDPS'17 paper): http://discl.cs.ttu.edu/gitlab/xiewei/ElasticConsistentHashing • Copyset consistent hashing on lib-ch-placement: http://xiewei@discl.cs.ttu.edu/gitlab/xiewei/ch-placement-copyset.git
LIFE Form Input Please take a moment to fill out your L.I.F.E. forms. http://www.iucrc.com Select “Cloud and Autonomic Computing Center” then select “IAB” role. What do you like about this project? What would you change? (Please include all relevant feedback.)