Sorrento: A Self-Organizing Distributed File System on Large-scale Clusters

Sorrento: A Self-Organizing Distributed File System on Large-scale Clusters Hong Tang, Aziz Gulbeden and Tao Yang Department of Computer Science, University of California, Santa Barbara

Information Management Challenges • “Disk full (again)!” • Cause: increasing storage demand. • Options: adding more disks, reorganizing data, removing garbage. • “Where are the data?” • Cause 1: scattered storage repositories. • Cause 2: disk corruptions (crashes). • Options: exhaustive search; indexing; backup. • Management headaches! • Nightmares for data-intensive applications and online services.

A Better World • A single repository – virtual disk. • A uniform hierarchical namespace. • Expand storage capacity on-demand. • Resilient to disk failures through data redundancy. • Fast and ubiquitous access. • Inexpensive storage.

Cluster-based Storage Systems • Turn a generic cluster as a storage system.

Why? • Clusters provide: • Cost-effective computing platform. • Incremental scalability. • High availability.

Design Objectives • Programmability • Virtualization of distributed storage resources. • Uniform namespace for data addressing. • Manageability • Incremental expansion. • Self-adaptive to node additions and departures. • Almost-zero administration. • Performance • Performance monitoring. • Intelligent data placement and migration. • 247 Availability • Replication support.

Design Choices • Use commodity components as much as possible. • Share-nothing architecture. • Functionally symmetric servers (serverless). • User-level file system. • Daemons run as user processes. • Possible to make it mountable through kernel modules.

Data Organization Model • User-perceived files are split into variable-length segments (data objects). • Data objects are linked by index objects. • Data and index objects are stored in their entirety as files within native file systems. • Objects are addressed through location-transparent GUIDs.

Multi-level Data Consistency Model • Level 0: best-effort without any guarantee. Possible to reorder I/O operations. • Level 1: time-ordered I/O operations. May observe problems of missed writes. • Level 2: open-to-close session consistency. The effect of multiple I/O operations within an open-to-close session are either ALL visible or NONE visible to others. May lead to abortion when there is a write/write conflict. • Level 3: adding file sharing and automatic conflict resolution upon Level 2.

System Architecture • Proxy Module • Data location and placement. • Monitor multicast channel. • Server Module • Export local storage. • Namespace Server • Maintain a global directory tree. • Translate filenames to root-object GUIDs.

Accessing a File 1. Wants to access /foo/bar 2. Ask “/foo/bar”s GUID 3. Get “/foo/bar”s GUID 4. Determine the server to contact 5. Ask for the root object 6. Retrieve the data 7. Contact other servers if necessary 8. Close file

Project Status • Distributed data placement and location protocol. To appear in SuperComputing 2003. • Prototype implementation done by summer 2003. • Production usage by end of 2003. • Project Web page: http://www.cs.ucsb.edu/~gulbeden/sorrento/

Evaluation • We are planning to use trace-driven evaluation. • Enables us to find problems without adding much to the system. • Performance of various applications can be measured without porting. • Allows us to reproduce and identify the any potential problem. • Applications that can benefit from the system are: • Web crawler. • Protein sequence matching. • Parallel I/O applications.

Project Status and Development Plan • Most software modules are implemented such as: consistent hashing, UDP request/responsemanagement, persistent hash table, file block cache, thread pool, and load statistics collection. • We are working on building a running prototype. • Milestones: • Barebone runtime system. • Add dynamic migration. • Add version-based data management and replication. • Add kernel VFS switch.

Conclusion • Project website • http://www.cs.ucsb.edu/~gulbeden/sorrento

Proxy Module • Consists of: • Dispatcher: listens for incoming requests. • Thread pool: processes requests from local applications. • Subscriber: monitors the multicast channel. • Stores: • A set of live hosts. • Address of the Namespace Server. • Set of opened file handles. • Accesses data by hashing GUID of the object.

Server Module • Consists of: • Dispatcher: Listens for requests (UDP or TCP). • Thread Pool: Handles requests for local operations. • Local Storage: Stores the local data. • Stores: • Global block table partition. • INode Map. • Physical local store.

Choice I: SAN • Distributed and heterogeneous devices. • Dedicated fast network. • Storage virtualization. • Volume-based. • Each volume managed by a dedicated server. • Volume map.

Choice I: SAN (cont) • Cost Disadvantage • Scalability • Manageability • Change the volume map. • Reorganize data on the old volume. • Handling disk failures: • Exclude failed disks from volume maps. • Restore data to spare disks. • Conclusions: • Hard to automate. • Prone to human errors (at large scale).

SAN: Storage Area Networks • Distributed and heterogeneous devices. • Dedicated fast network. • Storage virtualization. • Volume-based. • Each volume managed by a dedicated server. • Volume map.

Management Challenges of SAN • Expanding an existing volume: • Change the volume map. • Reorganize data on the old volume. • Handling disk failures: • Exclude failed disks from volume maps. • Restore data to spare disks. • Conclusions: • Hard to automate. • Prone to human errors (at large scale).

Sorrento: A Self-Organizing Distributed File System on Large-scale Clusters