490 likes | 601 Views
REDDnet : Enabling Data Intensive Science in the Wide Area. Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard , Santiago de ledesma , and Matt Heller Vanderbilt University. Overview. What is REDDnet ? and Why is it needed? Logistical Networking Distributed Hash Tables
E N D
REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt University
Overview • What is REDDnet? and Why is it needed? • Logistical Networking • Distributed Hash Tables • L-Store – Putting it all together This is a work in progress! Most of the metadata operations are in the prototype stage.
REDDnet: Distributed Storage Projecthttp://www.reddnet.org NSF funded project Currently have 500TB Currently 13 sites Multiple disciplines Sat imagery (AmericaView) HEP TerascaleSupernova Initative Bioinformatics Vanderbilt TV News Archive Large Synoptic Survey Telescope CMS-HI Goal: Providing working storage for distributed scientific collaborations.
“Data Explosion” • Focus on increasing bandwidth and raw storage • Assume metadata growth is minimal Metadata Raw storage
“Data Explosion”of both data and metadata • Focus on increasing bandwidth and raw storage • Assume metadata growth is minimal • Works great for large files • For collections of small files the metadata becomes the bottleneck • Need ability to scale metadata • ACCRE examples • Proteomics: 89,000,000 files totaling 300G • Genetics: 12,000,000 files totaling 50G in a single directory Metadata Raw storage
Design Requirements • Availability • Should survive a partial network outage • No downtimes, hot upgradable • Data and Metadata Integrity • End-to-end guarantees a must! • Performance • Metadata(transactions/s) • Data(MB/s) • Security • Fault Tolerance- Should survive multiple complete device failures • Metadata • Data Very different design requirements
Logistical Networking (LN) provides a “bits are bits” Infrastructure • Standardize on what we have an adequate common model for • Storage/buffer management • Coarse-grained data transfer • Leave everything else to higher layers • End-to-end services: checksums, encryption, error encoding, etc. • Enable autonomy in wide area service creation • Security, resource allocation, QoS guarantees • Gain the benefits of interoperability! One structure serves all
LoCITools*Logistical Computing and Internetworking Lab • IBP Internet Backplane Protocol • Middleware for managing and using remote storage • Allows advanced space and TIME reservation • Supports multiple connections/depot • User configurable block size • Designed to support large scale, distributed systems • Provides global “malloc()” and “free()” • End-to-end guarantees • Capabilities • Each allocation has separate Read/Write/Manage keys IBP is the “waist of the hourglass” for storage *http://loci.cs.utk.edu
What makes LN different?Based on a highly generic abstract block for storage(IBP)
What makes LN different?Based on a highly generic abstract block for storage(IBP) Storage Depot
IBP Server or Depot • Runs the ibp_server process • Resource • Unique ID • Separate data and metadata partitions (or directories) • Optionally can import metadata to SSD • Typically JBOD disk configuration • Heterogeneous disk sizes and types • Don’t have to use dedicated disks OS and optionally RID DBs
IBP functionality • Allocate • Reserve space for a limited time • Can create allocation Aliases to control access • Enable block level disk checksums to detect silent read errors • Manage allocations • INC/DEC allocation reference count (when 0 allocation is removed) • Extend duration • Change size • Query duration, size, and reference counts • Read/Write • Can use either append mode or use random access • Optionally use network checksums for end-to-end guarantees • Depot-Depot Transfers • Data can be either pushed or pulled to allow firewall traversal • Supports both append and random offsets • Supports network checksums for transfers • Pheobus support – I2 overlay routing to improve performance • Depot Status • Number of resources and their resource ID’s • Amount of free space • Software version
What is a “capability”? • Controls access to an IBP allocation • 3 separate caps/allocation • Read • Write • Manage - delete or modify an allocation • Alias or Proxy allocations supported • End user never has to see true capabilities • Can be revoked at any time
Alias Allocations • A temporary set of capabilities providing limited access to an existing allocation • Byte-range or full access is allowed • Multiple alias allocations can exist for the same primary allocation • Can be revoked at any time
Split/Merge Allocations • Split • Split an existing allocations space into two unique allocations • Data is not shared. The initial primary allocation is truncated. • Merge • Merge two allocations together with only the “primary” allocation’s capabilities surviving • Data is not merged. The primaries size is increased and the secondary allocation is removed. • Designed to prevent Denial of “Space” attacks • Allows implementation of user and group quotas • Split and Merge are atomic operations • Prevents another party from grabbing the space during the operation.
Depot-Depot TransfersCopy data from A to B Public IP Private IP Private IP Depot A Depot B Firewall Firewall
Depot-Depot TransfersCopy data from A to B Public IP Private IP Private IP Direct transfer gets blocked at the firewall! Depot A Depot B Firewall Firewall
Depot-Depot TransfersCopy data from A to B Public IP Private IP Private IP B issues a PULL A issues a PUSH Depot A Depot B Firewall Firewall
exNodeXML file containing metadata • Analogous to a disk I-node and contains • Allocations • How to assemble file • Fault tolerance encoding scheme • Encryption keys IBP Depots Network 0 100 200 300 A B C Replicated at different sites Replicated and striped Normal file
eXnode3 • Exnode • Collection of layouts • Different layouts can be used for versioning, replication, optimized access (row vs. column), etc. • Layout • Simple mappings for assembling the file • Layout offset, segment offset, length, segment • Segment • Collection of blocks with a predefined structure. • Type: Linear, striped, shifted stripe, RAID5, Generalized Reed-Solomon • Block • Simplest example is a full allocation • Can also be an allocation fragment
Segment types Linear Striped 1 1 2 3 4 Single logical device 1 2 3 4 1 1 2 3 4 Blocks can have unequal size 1 1 2 3 4 Multiple devices 1 Shifted Stripe Access is linear 1 2 3 4 4 1 2 3 3 4 1 2 2 3 4 1 Multiple devices using shifted stride between rows RAID5 and Reed-Solomon segments are variations of the Shifted Stripe
Optimize data layout for WAN performance Skip to a specific video frame • Have to read the header and each frame header • WAN latency and disk seeks kill performance 1 2 b 3 c 4 d 5 e header a Frame header and data header 1 2 3 4 5 a b c d e Optimized layout Header and Frame information • Separate header and data into separate segments • Layout contains mappings to present the traditional file view • All header data is contiguous minimizing disk seeks. • Easy to add additional frames Frame data
Latency Impact on Performance Synchronous Asynchronous Client Server Client Server Latency Latency Command Time
Asynchronous Internal architecture Operation • Each host has a dedicated queue • Multiple In-flight queues for each host (depends on load) • Send/Recv Queues are independent threads • Socket library must be thread safe! • Each operation has a weight • An In-flight queue has a max weight • If socket errors occur # of In-flight queues are adjusted for stability If socket error retry Global Host Queue In-flight Queue 1 Send Queue Recv Queue Operation completed
Asynchronous client interface • Similar to MPI_WaitAll() or MPI_WaitAny() Framework • All commands return handles • Execution can start immediately or be delayed • Failed ops can be tracked ibp_capset_t *create_allocs(intnallocs, intasize, ibp_depot_t *depot) { inti, err; ibp_attributes_tattr; oplist_t *oplist; ibp_op_t *op; //**Create caps list which is returned ** ibp_capset_t *caps = (ibp_capset_t *)malloc(sizeof(ibp_capset_t)*nallocs); //** Specify the allocations attributes ** set_ibp_attributes(&attr, time(NULL) + 900, IBP_HARD, IBP_BYTEARRAY); oplist = new_ibp_oplist(NULL); //**Create a new list of ops oplist_start_execution(oplist); //** Go on and start executing tasks. This could be done any time //*** Main loop for creating the allocation ops *** for (i=0; i<nallocs; i++) { op = new_ibp_alloc_op(&(caps[i]), asize, depot, &attr, ibp_timeout, NULL); //**This is the actual alloc op add_ibp_oplist(oplist, op); //** Now add it to the list and start execution } err = ibp_waitall(oplist); //** Now wait for them all to complete if (err != IBP_OK) { printf(“create_allocs: At least 1 error occured! * ibp_errno=%d * nfailed=%d\n”, err, ibp_oplist_nfailed(iolist)); } free_oplist(oplist); //** Free all the ops and oplist info return(caps); }
Distributed Hash Tables • Provides a mechanism to distribute a single hash table across multiple nodes • Used by many Peer-to-peer systems • Properties • Decentralized – No centralized control • Scalable – Most can support 100s to 1000s of nodes • Lots of implementations – Chord, Pastry, Tapestry, Kademlia, Apache Cassandra, etc. • Most differ by lookup() implementation • Key space is partitioned across all nodes • Each node keeps a local routing table, typically routing is O(log n).
N1 N56 N8 K54 K10 N14 N21 N42 K24 K30 N38 N32 Valid key range for N32 is 22-32 Chord Ring*Distributed Hash Table • Distributed Hash Table • Maps a Key (K##) - hash(name) • Nodes (N##) are distributed around the ring and are responsible for the keys “behind” them. • O(lg N) lookup *F. Dabek, et al., "Building Peer-to-Peer Systems With Chord, a Distributed Lookup Service." In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), May, 2001.
N1 N1 lookup(54) N56 N56 N8 N8 +1 K54 K10 +2 N51 N51 +4 N14 +32 +8 N48 N48 +16 N21 N21 N42 N42 K24 K30 N38 N38 N32 N32 Locating keys N14
Apache Cassandrahttp://cassandra.apache.org/ • Open source, actively developed, used by major companies • Created at and still used by Facebook, now an Apache project and also used by Digg, Twitter, Reddit, Rackspace, Cloudkick, Cisco • Decentralized • Every node is identical, no single points of failure • Scalable • Designed to scale across many machines to meet throughput or size needs (100+ node production systems) • Elastic Infrastructure • Nodes can be added and removed without downtime • Fault Tolerance through replication • Flexible replication policy, support for location aware policies (rack and data center) • Uses Apache Thrift for communication between clients and server nodes • Thift bindings are available for C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and Ocaml). Language specific client libraries also available.
Apache Cassandra http://cassandra.apache.org/ • High write speed • Use of RAM for write caching and sorting and append only disk writes (sequential writes are fast, seeking is slow, especially for mechanical hard drives) • Durable • Use of a commit log insures that data will not be lost if a node goes down (similar to filesystem journaling). • Elastic Data Organization • Schema-less, namespaces, tables, and columns can be added and removed on the fly • Operations support various consistency requirements • Reads/Writes of some data may favor high availability and performance and accept the risk of stale data. • Reads/Writes of other data may favor stronger consistency guarantees and sacrifice performance and availability when nodes fail • Does not support SQL and the wide range of query types of RDBMS • Limited atomicity and support for transactions (no commit, rollback, etc)
L-Storehttp://www.lstore.org • Generic object storage framework using tree based structure • Data and metadata scale independently • Large objects, typically file data, is stored in IBP depots • Metadata is stored in Apache Cassandra • Fault-tolerant in both data and metadata • No downtime for adding or removing storage or metadata nodes. • Heterogeneous disk sizes and type supported • Lifecycle management – can migrate data off old servers • Functionality can be easily extended via policies. • The L-Store core is just a routing framework. All commands, both system and user provided, are added in the same way.
L-Store Policy • A Policy is used to extend L-Store functionality and is comprised of 3 components – command, stored procedures, and services • Each can be broken into an ID, arguments, and execution handler • Arguments are stored in google protocol buffer messages • The ID is used for routing
Logistical Network Message Protocol • Provides a common communication framework • Message Format [4 byte command] [4 byte message size] [message] [data] • Command – Similar to an IPv4 octet (10.0.1.5) • Message size – 32-bit unsigned big endian value • Message – Google Protocol Buffer message • Designed to handle command options/variables • Entire message is always unpacked using internal buffer • Data – Optional field for large data blocks
Google Protocol Buffershttp://code.google.com/apis/protocolbuffers/ • Language-neutral mechanism for serializing data structures • Binary compressed format • Special “compiler” to generate code for packing/unpacking messages • Messages support most primitive data types, arrays, optional fields, and message nesting package activityLogger; message alog_file { required string name = 1; required uint32 startTime = 2; required uint32 stopTime = 3; required uint32 depotTime = 4; required sint64 messageSize = 5; }
L-Store Policies • Commands • Processes heavyweight operations, including multi-object locking coordination via locking service • Stored Procedures • Lightweight atomic operations on a single object. • Simple example is updating multiple key/value pairs on an object. • More sophisticated examples would test/operate on multiple key/value pairs and only perform the update on success • Services • Triggered via metadata changes made through stored procedures • Services register with a messaging queue or creates new queues listening for events. • Can trigger other L-Store commands • Can’t access metadata directly
L-Store ArchitectureClient • Client contacts L-Servers to issue commands • Data transfers are done directly via the depots with the L-Server providing the capabilities • Can also create locks if needed Arrow direction signifies which component initiates communication
L-Store ArchitectureL-Server • Essentially a routing layer for matching incoming commands to execution services • All persistent state information is stored in the metadata layer • Clients can contact multiple L-Servers to increase performance • Provides all Auth/AuthZ services • Controls access to metadata Arrow direction signifies which component initiates communication
L-Store ArchitectureStorCore* • From Nevoa Networks • Disk Resource Manager • Allows grouping of resources into Logical Volumes • A resource can be a member of multiple Dynamical Volumes • Tracks depot reliability and space Arrow direction signifies which component initiates communication *http://www.nevoanetworks.com/
L-Store ArchitectureMetadata • L-Store DB • Tree-based structure to organize objects • Arbitrary number of key/value pairs per object • All MD operations are atomic and only operate on a single object • Multiple key/value pairs can be updated simultaneously • Can be combined with testing (stored procedure) Arrow direction signifies which component initiates communication
L-Store ArchitectureDepot • Used to store large objects. Traditionally file contents. • IBP storage • Fault tolerance is done at the L-Store level. • Clients directly access the depots • Depot-to-depot transfers are supported. • Multiple depot implementations • ACCRE depot* is actively being developed Arrow direction signifies which component initiates communication *http://www.lstore.org/pwiki/pmwiki.php?n=Docs.IBPServer
L-Store ArchitectureMessaging Service • Event notification system • Uses the Advanced Messaging Queuing Protocol • L-Store provides a lightweight layer to abstract all queue interaction Arrow direction signifies which component initiates communication
L-Store ArchitectureServices • Mainly handles events triggered by metadata changes • L-Server commands and other services can also trigger events. • Typically services are designed to handle operations that are not time sensitive. Arrow direction signifies which component initiates communication
L-Store ArchitectureLock Manager • Variation of the Cassandra metadata nodes but focusing on providing a simple locking mechanism. • Each lock is of limited duration, just a few seconds, and must be continually renewed. • Guarantees that a dead client doesn’t lock up the entire system • To prevent deadlock a client must present all objects to be locked in a single call. Arrow direction signifies which component initiates communication
Current functionality • Basic functionality • Ls, cp, mkdir, rmdir, etc • Augment – Create an additional copy using a different LUN • Trim – Remove a LUN copy • Add/Remove arbitrary attributes • RAID5 and Linear stripe supported • FUSE mount for Linux and OSX • Used for CMS-HI • Basic messaging and services • L-Sync, similar to DropBox • Tape archive or HSM integration
3 GB/s 30 Mins L-Store Performance circa 2004 • Multiple simultaneous writes to 24 depots. • Each depot is a 3 TB disk server in a 1U case. • 30 clients on separate systems uploading files. • Rate has scaled linearly as depots added. • Latest generation depot is 72TB with 18Gb/s sustained transfer rates
Summary • REDDnet provides short term working storage • Scalable in both metadata (op/s) and storage (MB/s) • Block level abstraction using IBP • Apache Cassandra for metadata scalability • Flexible framework supporting the addition of user operations