Fault Tolerant Stream Processing using Distributed Replicated File System

Fault Tolerant Stream Processing using Distributed Replicated File System

Introduction • Stream Processing Engines • Is a monitoring tool which monitors the activities • In SPE, a query takes the form of loop-free, directed graph of operators. Each operator processes data arriving on its input and produces data on its output stream

These query are called query diagrams • Every server runs a instance of SPE and it is called processing node • When any nodes fail, these failure causes SPE to give wrong results or it blocks SPE from processing • Challenge is fault tolerance

Fault tolerant techniques • Replication  Most of the servers hold replicated state in memory until servers fails and these replicated states are frequently updated through a process called check pointing to insure that backups are updated with the current state

In those techniques, SPE not only streams data in memory but also it does checkpoints in memory. • Since these processing are done in memory without going to disks, at least half of the cluster resources are dedicated to fault tolerance • So when replication is done and is updated, the number of servers available for normal stream processing is cut at least in half and that the memory capacity also cut at least in half. • Costly

Recovery  with this approach, only one primary replica of each node process streams. This primary nodes periodically takes snapshots of these states and send to the other nodes • When failure, one of these backups works as primary

SGuard • Based on Rollback recovery technique • Each SPE node takes periodic checkpoints of its state and write these checkpoints to stable storage. • To take a checkpoint, node suspends its processing and makes copy of its state. • Used DFS as stable storage

Challenge in using Disk • Saving checkpoints to disk saves the memory but it has high write latency. • To avoid its flaws, we use DFS. • Some important properties • Supports for large data files • Fault-tolerance through replication

SGuard System Architecture

Three main points -uses disks to store checkpoints with DFS -uses peace scheduler -uses Memory Management Middleware (MMM)

SGuard Software Architecture

Make checkpoints asynchronous so that operators can continue processing during checkpoints • SGuards extends the DFS Co-ordinator with a schedular called Peace that reduces the total time to write the state of a single HA unit while maintaining good overall resource utilizations • HA units are the group of interconnected groups of operators under a node

The SGuard techniques introduces new middleware layer includes MMM • Also it includes chkptMngr and IOService • ChkptMngr  manages checkpoint and recovery operations. It checkpoints HA unit in 5 steps

It informs HAInput that check point is starting • Prepares the state of the HAUnit • It writes the prepared states into the DFS • Informs the co-ordinator about the new checkpoints • Notifies HAInput Operator that checkpoint is completed

MMM Memory Manager • To enable concurrent check points where the state of an operator is copied to disk while the operator continues executing, SGuard must control the memory of stream processing operators. • MMM partitions SPE memory into collection of pages where operators states are stored

To checkpoint the state of the operator, its pages are recopied to disk • So when checkpoint begins, there operator is briefly suspended and all pages are marked as read only • The operator execution resumes and pages are written in the disk in the back ground

MMM has two layers: Page Manager(PM) and Data Structure(DS) • PM allocates, frees, and check points pages • DS implements data structure abstractions on top of the PM’s page abstraction. MMM also has library of data structure wrappers on top of each page called the Page Layout (PL) library

Page Manager • It maintains the list of free and allocated pages • Controls all the requests for allocating and freeing these pages • PM maintains and exposes a page table that maps PageId onto the address of the memory

Page Layout Library • Is a wrapper for a page and has two main features • Provides data structure abstraction on top of each page • Provides level of indirection between the data structure and the underlying pages enabling copy-on-write of the pages during checkpoints

DS Layer • Creats the meaningful relationship between pages

Peace Scheduler • Addresses resource contention problem • Schedules the writes in a manner that reduces the time to write each set of chunks while keeping the total time for completing all writes small • It does so by scheduling only as many concurrent writes as there are available resources, scheduling all writes from the same set close together, and by selecting destinations for each write in a way that avoids resource contention

Nodes submit write request to the co-ordinator in forms of triples(w,r,k) • The algorithm finds out the best destination using the min-cost max-flow problem

Conclusion • SGuard improves SPE checkpoints transparency through MMM which enables efficient asynchronous checkpointings

Thank you

Fault Tolerant Stream Processing using Distributed Replicated File System