210 likes | 383 Views
Sorrento: A Self-Organizing Distributed File System on Large-scale Clusters. Hong Tang, Aziz Gulbeden and Tao Yang. Department of Computer Science, University of California, Santa Barbara. Information Management Challenges. “Disk full (again)!” Cause: increasing storage demand.
E N D
Sorrento: A Self-Organizing Distributed File System on Large-scale Clusters Hong Tang, Aziz Gulbeden and Tao Yang Department of Computer Science, University of California, Santa Barbara
Information Management Challenges • “Disk full (again)!” • Cause: increasing storage demand. • Options: adding more disks, reorganizing data, removing garbage. • “Where are the data?” • Cause 1: scattered storage repositories. • Cause 2: disk corruptions (crashes). • Options: exhaustive search; indexing; backup. • Management headaches! • Nightmares for data-intensive applications and online services.
A Better World • A single repository – virtual disk. • A uniform hierarchical namespace. • Expand storage capacity on-demand. • Resilient to disk failures through data redundancy. • Fast and ubiquitous access. • Inexpensive storage.
Cluster-based Storage Systems • Turn a generic cluster as a storage system.
Why? • Clusters provide: • Cost-effective computing platform. • Incremental scalability. • High availability.
Design Objectives • Programmability • Virtualization of distributed storage resources. • Uniform namespace for data addressing. • Manageability • Incremental expansion. • Self-adaptive to node additions and departures. • Almost-zero administration. • Performance • Performance monitoring. • Intelligent data placement and migration. • 247 Availability • Replication support.
Design Choices • Use commodity components as much as possible. • Share-nothing architecture. • Functionally symmetric servers (serverless). • User-level file system. • Daemons run as user processes. • Possible to make it mountable through kernel modules.
Data Organization Model • User-perceived files are split into variable-length segments (data objects). • Data objects are linked by index objects. • Data and index objects are stored in their entirety as files within native file systems. • Objects are addressed through location-transparent GUIDs.
Multi-level Data Consistency Model • Level 0: best-effort without any guarantee. Possible to reorder I/O operations. • Level 1: time-ordered I/O operations. May observe problems of missed writes. • Level 2: open-to-close session consistency. The effect of multiple I/O operations within an open-to-close session are either ALL visible or NONE visible to others. May lead to abortion when there is a write/write conflict. • Level 3: adding file sharing and automatic conflict resolution upon Level 2.
System Architecture • Proxy Module • Data location and placement. • Monitor multicast channel. • Server Module • Export local storage. • Namespace Server • Maintain a global directory tree. • Translate filenames to root-object GUIDs.
Accessing a File 1. Wants to access /foo/bar 2. Ask “/foo/bar”s GUID 3. Get “/foo/bar”s GUID 4. Determine the server to contact 5. Ask for the root object 6. Retrieve the data 7. Contact other servers if necessary 8. Close file
Project Status • Distributed data placement and location protocol. To appear in SuperComputing 2003. • Prototype implementation done by summer 2003. • Production usage by end of 2003. • Project Web page: http://www.cs.ucsb.edu/~gulbeden/sorrento/
Evaluation • We are planning to use trace-driven evaluation. • Enables us to find problems without adding much to the system. • Performance of various applications can be measured without porting. • Allows us to reproduce and identify the any potential problem. • Applications that can benefit from the system are: • Web crawler. • Protein sequence matching. • Parallel I/O applications.
Project Status and Development Plan • Most software modules are implemented such as: consistent hashing, UDP request/responsemanagement, persistent hash table, file block cache, thread pool, and load statistics collection. • We are working on building a running prototype. • Milestones: • Barebone runtime system. • Add dynamic migration. • Add version-based data management and replication. • Add kernel VFS switch.
Conclusion • Project website • http://www.cs.ucsb.edu/~gulbeden/sorrento
Proxy Module • Consists of: • Dispatcher: listens for incoming requests. • Thread pool: processes requests from local applications. • Subscriber: monitors the multicast channel. • Stores: • A set of live hosts. • Address of the Namespace Server. • Set of opened file handles. • Accesses data by hashing GUID of the object.
Server Module • Consists of: • Dispatcher: Listens for requests (UDP or TCP). • Thread Pool: Handles requests for local operations. • Local Storage: Stores the local data. • Stores: • Global block table partition. • INode Map. • Physical local store.
Choice I: SAN • Distributed and heterogeneous devices. • Dedicated fast network. • Storage virtualization. • Volume-based. • Each volume managed by a dedicated server. • Volume map.
Choice I: SAN (cont) • Cost Disadvantage • Scalability • Manageability • Change the volume map. • Reorganize data on the old volume. • Handling disk failures: • Exclude failed disks from volume maps. • Restore data to spare disks. • Conclusions: • Hard to automate. • Prone to human errors (at large scale).
SAN: Storage Area Networks • Distributed and heterogeneous devices. • Dedicated fast network. • Storage virtualization. • Volume-based. • Each volume managed by a dedicated server. • Volume map.
Management Challenges of SAN • Expanding an existing volume: • Change the volume map. • Reorganize data on the old volume. • Handling disk failures: • Exclude failed disks from volume maps. • Restore data to spare disks. • Conclusions: • Hard to automate. • Prone to human errors (at large scale).