1 / 49

REDDnet : Enabling Data Intensive Science in the Wide Area

REDDnet : Enabling Data Intensive Science in the Wide Area. Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard , Santiago de ledesma , and Matt Heller Vanderbilt University. Overview. What is REDDnet ? and Why is it needed? Logistical Networking Distributed Hash Tables

lucus
Download Presentation

REDDnet : Enabling Data Intensive Science in the Wide Area

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt University

  2. Overview • What is REDDnet? and Why is it needed? • Logistical Networking • Distributed Hash Tables • L-Store – Putting it all together This is a work in progress! Most of the metadata operations are in the prototype stage.

  3. REDDnet: Distributed Storage Projecthttp://www.reddnet.org NSF funded project Currently have 500TB Currently 13 sites Multiple disciplines Sat imagery (AmericaView) HEP TerascaleSupernova Initative Bioinformatics Vanderbilt TV News Archive Large Synoptic Survey Telescope CMS-HI Goal: Providing working storage for distributed scientific collaborations.

  4. “Data Explosion” • Focus on increasing bandwidth and raw storage • Assume metadata growth is minimal Metadata Raw storage

  5. “Data Explosion”of both data and metadata • Focus on increasing bandwidth and raw storage • Assume metadata growth is minimal • Works great for large files • For collections of small files the metadata becomes the bottleneck • Need ability to scale metadata • ACCRE examples • Proteomics: 89,000,000 files totaling 300G • Genetics: 12,000,000 files totaling 50G in a single directory Metadata Raw storage

  6. Design Requirements • Availability • Should survive a partial network outage • No downtimes, hot upgradable • Data and Metadata Integrity • End-to-end guarantees a must! • Performance • Metadata(transactions/s) • Data(MB/s) • Security • Fault Tolerance- Should survive multiple complete device failures • Metadata • Data Very different design requirements

  7. Logistical Networking

  8. Logistical Networking (LN) provides a “bits are bits” Infrastructure • Standardize on what we have an adequate common model for • Storage/buffer management • Coarse-grained data transfer • Leave everything else to higher layers • End-to-end services: checksums, encryption, error encoding, etc. • Enable autonomy in wide area service creation • Security, resource allocation, QoS guarantees • Gain the benefits of interoperability! One structure serves all

  9. LoCITools*Logistical Computing and Internetworking Lab • IBP Internet Backplane Protocol • Middleware for managing and using remote storage • Allows advanced space and TIME reservation • Supports multiple connections/depot • User configurable block size • Designed to support large scale, distributed systems • Provides global “malloc()” and “free()” • End-to-end guarantees • Capabilities • Each allocation has separate Read/Write/Manage keys IBP is the “waist of the hourglass” for storage *http://loci.cs.utk.edu

  10. What makes LN different?Based on a highly generic abstract block for storage(IBP)

  11. What makes LN different?Based on a highly generic abstract block for storage(IBP) Storage Depot

  12. IBP Server or Depot • Runs the ibp_server process • Resource • Unique ID • Separate data and metadata partitions (or directories) • Optionally can import metadata to SSD • Typically JBOD disk configuration • Heterogeneous disk sizes and types • Don’t have to use dedicated disks OS and optionally RID DBs

  13. IBP functionality • Allocate • Reserve space for a limited time • Can create allocation Aliases to control access • Enable block level disk checksums to detect silent read errors • Manage allocations • INC/DEC allocation reference count (when 0 allocation is removed) • Extend duration • Change size • Query duration, size, and reference counts • Read/Write • Can use either append mode or use random access • Optionally use network checksums for end-to-end guarantees • Depot-Depot Transfers • Data can be either pushed or pulled to allow firewall traversal • Supports both append and random offsets • Supports network checksums for transfers • Pheobus support – I2 overlay routing to improve performance • Depot Status • Number of resources and their resource ID’s • Amount of free space • Software version

  14. What is a “capability”? • Controls access to an IBP allocation • 3 separate caps/allocation • Read • Write • Manage - delete or modify an allocation • Alias or Proxy allocations supported • End user never has to see true capabilities • Can be revoked at any time

  15. Alias Allocations • A temporary set of capabilities providing limited access to an existing allocation • Byte-range or full access is allowed • Multiple alias allocations can exist for the same primary allocation • Can be revoked at any time

  16. Split/Merge Allocations • Split • Split an existing allocations space into two unique allocations • Data is not shared. The initial primary allocation is truncated. • Merge • Merge two allocations together with only the “primary” allocation’s capabilities surviving • Data is not merged. The primaries size is increased and the secondary allocation is removed. • Designed to prevent Denial of “Space” attacks • Allows implementation of user and group quotas • Split and Merge are atomic operations • Prevents another party from grabbing the space during the operation.

  17. Depot-Depot TransfersCopy data from A to B Public IP Private IP Private IP Depot A Depot B Firewall Firewall

  18. Depot-Depot TransfersCopy data from A to B Public IP Private IP Private IP Direct transfer gets blocked at the firewall! Depot A Depot B Firewall Firewall

  19. Depot-Depot TransfersCopy data from A to B Public IP Private IP Private IP B issues a PULL A issues a PUSH Depot A Depot B Firewall Firewall

  20. exNodeXML file containing metadata • Analogous to a disk I-node and contains • Allocations • How to assemble file • Fault tolerance encoding scheme • Encryption keys IBP Depots Network 0 100 200 300 A B C Replicated at different sites Replicated and striped Normal file

  21. eXnode3 • Exnode • Collection of layouts • Different layouts can be used for versioning, replication, optimized access (row vs. column), etc. • Layout • Simple mappings for assembling the file • Layout offset, segment offset, length, segment • Segment • Collection of blocks with a predefined structure. • Type: Linear, striped, shifted stripe, RAID5, Generalized Reed-Solomon • Block • Simplest example is a full allocation • Can also be an allocation fragment

  22. Segment types Linear Striped 1 1 2 3 4 Single logical device 1 2 3 4 1 1 2 3 4 Blocks can have unequal size 1 1 2 3 4 Multiple devices 1 Shifted Stripe Access is linear 1 2 3 4 4 1 2 3 3 4 1 2 2 3 4 1 Multiple devices using shifted stride between rows RAID5 and Reed-Solomon segments are variations of the Shifted Stripe

  23. Optimize data layout for WAN performance Skip to a specific video frame • Have to read the header and each frame header • WAN latency and disk seeks kill performance 1 2 b 3 c 4 d 5 e header a Frame header and data header 1 2 3 4 5 a b c d e Optimized layout Header and Frame information • Separate header and data into separate segments • Layout contains mappings to present the traditional file view • All header data is contiguous minimizing disk seeks. • Easy to add additional frames Frame data

  24. Latency Impact on Performance Synchronous Asynchronous Client Server Client Server Latency Latency Command Time

  25. Asynchronous Internal architecture Operation • Each host has a dedicated queue • Multiple In-flight queues for each host (depends on load) • Send/Recv Queues are independent threads • Socket library must be thread safe! • Each operation has a weight • An In-flight queue has a max weight • If socket errors occur # of In-flight queues are adjusted for stability If socket error retry Global Host Queue In-flight Queue 1 Send Queue Recv Queue Operation completed

  26. Asynchronous client interface • Similar to MPI_WaitAll() or MPI_WaitAny() Framework • All commands return handles • Execution can start immediately or be delayed • Failed ops can be tracked ibp_capset_t *create_allocs(intnallocs, intasize, ibp_depot_t *depot) { inti, err; ibp_attributes_tattr; oplist_t *oplist; ibp_op_t *op; //**Create caps list which is returned ** ibp_capset_t *caps = (ibp_capset_t *)malloc(sizeof(ibp_capset_t)*nallocs); //** Specify the allocations attributes ** set_ibp_attributes(&attr, time(NULL) + 900, IBP_HARD, IBP_BYTEARRAY); oplist = new_ibp_oplist(NULL); //**Create a new list of ops oplist_start_execution(oplist); //** Go on and start executing tasks. This could be done any time //*** Main loop for creating the allocation ops *** for (i=0; i<nallocs; i++) { op = new_ibp_alloc_op(&(caps[i]), asize, depot, &attr, ibp_timeout, NULL); //**This is the actual alloc op add_ibp_oplist(oplist, op); //** Now add it to the list and start execution } err = ibp_waitall(oplist); //** Now wait for them all to complete if (err != IBP_OK) { printf(“create_allocs: At least 1 error occured! * ibp_errno=%d * nfailed=%d\n”, err, ibp_oplist_nfailed(iolist)); } free_oplist(oplist); //** Free all the ops and oplist info return(caps); }

  27. Distributed Hash Tables

  28. Distributed Hash Tables • Provides a mechanism to distribute a single hash table across multiple nodes • Used by many Peer-to-peer systems • Properties • Decentralized – No centralized control • Scalable – Most can support 100s to 1000s of nodes • Lots of implementations – Chord, Pastry, Tapestry, Kademlia, Apache Cassandra, etc. • Most differ by lookup() implementation • Key space is partitioned across all nodes • Each node keeps a local routing table, typically routing is O(log n).

  29. N1 N56 N8 K54 K10 N14 N21 N42 K24 K30 N38 N32 Valid key range for N32 is 22-32 Chord Ring*Distributed Hash Table • Distributed Hash Table • Maps a Key (K##) - hash(name) • Nodes (N##) are distributed around the ring and are responsible for the keys “behind” them. • O(lg N) lookup *F. Dabek, et al., "Building Peer-to-Peer Systems With Chord, a Distributed Lookup Service." In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), May, 2001.

  30. N1 N1 lookup(54) N56 N56 N8 N8 +1 K54 K10 +2 N51 N51 +4 N14 +32 +8 N48 N48 +16 N21 N21 N42 N42 K24 K30 N38 N38 N32 N32 Locating keys N14

  31. Apache Cassandrahttp://cassandra.apache.org/ • Open source, actively developed, used by major companies • Created at and still used by Facebook, now an Apache project and also used by Digg, Twitter, Reddit, Rackspace, Cloudkick, Cisco • Decentralized • Every node is identical, no single points of failure • Scalable • Designed to scale across many machines to meet throughput or size needs (100+ node production systems) • Elastic Infrastructure • Nodes can be added and removed without downtime • Fault Tolerance through replication • Flexible replication policy, support for location aware policies (rack and data center) • Uses Apache Thrift for communication between clients and server nodes • Thift bindings are available for C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and Ocaml). Language specific client libraries also available.

  32. Apache Cassandra http://cassandra.apache.org/ • High write speed • Use of RAM for write caching and sorting and append only disk writes (sequential writes are fast, seeking is slow, especially for mechanical hard drives) • Durable • Use of a commit log insures that data will not be lost if a node goes down (similar to filesystem journaling). • Elastic Data Organization • Schema-less, namespaces, tables, and columns can be added and removed on the fly • Operations support various consistency requirements • Reads/Writes of some data may favor high availability and performance and accept the risk of stale data. • Reads/Writes of other data may favor stronger consistency guarantees and sacrifice performance and availability when nodes fail • Does not support SQL and the wide range of query types of RDBMS • Limited atomicity and support for transactions (no commit, rollback, etc)

  33. L-Store – Putting it all together

  34. L-Storehttp://www.lstore.org • Generic object storage framework using tree based structure • Data and metadata scale independently • Large objects, typically file data, is stored in IBP depots • Metadata is stored in Apache Cassandra • Fault-tolerant in both data and metadata • No downtime for adding or removing storage or metadata nodes. • Heterogeneous disk sizes and type supported • Lifecycle management – can migrate data off old servers • Functionality can be easily extended via policies. • The L-Store core is just a routing framework. All commands, both system and user provided, are added in the same way.

  35. L-Store Policy • A Policy is used to extend L-Store functionality and is comprised of 3 components – command, stored procedures, and services • Each can be broken into an ID, arguments, and execution handler • Arguments are stored in google protocol buffer messages • The ID is used for routing

  36. Logistical Network Message Protocol • Provides a common communication framework • Message Format [4 byte command] [4 byte message size] [message] [data] • Command – Similar to an IPv4 octet (10.0.1.5) • Message size – 32-bit unsigned big endian value • Message – Google Protocol Buffer message • Designed to handle command options/variables • Entire message is always unpacked using internal buffer • Data – Optional field for large data blocks

  37. Google Protocol Buffershttp://code.google.com/apis/protocolbuffers/ • Language-neutral mechanism for serializing data structures • Binary compressed format • Special “compiler” to generate code for packing/unpacking messages • Messages support most primitive data types, arrays, optional fields, and message nesting package activityLogger; message alog_file { required string name = 1; required uint32 startTime = 2; required uint32 stopTime = 3; required uint32 depotTime = 4; required sint64 messageSize = 5; }

  38. L-Store Policies • Commands • Processes heavyweight operations, including multi-object locking coordination via locking service • Stored Procedures • Lightweight atomic operations on a single object. • Simple example is updating multiple key/value pairs on an object. • More sophisticated examples would test/operate on multiple key/value pairs and only perform the update on success • Services • Triggered via metadata changes made through stored procedures • Services register with a messaging queue or creates new queues listening for events. • Can trigger other L-Store commands • Can’t access metadata directly

  39. L-Store ArchitectureClient • Client contacts L-Servers to issue commands • Data transfers are done directly via the depots with the L-Server providing the capabilities • Can also create locks if needed Arrow direction signifies which component initiates communication

  40. L-Store ArchitectureL-Server • Essentially a routing layer for matching incoming commands to execution services • All persistent state information is stored in the metadata layer • Clients can contact multiple L-Servers to increase performance • Provides all Auth/AuthZ services • Controls access to metadata Arrow direction signifies which component initiates communication

  41. L-Store ArchitectureStorCore* • From Nevoa Networks • Disk Resource Manager • Allows grouping of resources into Logical Volumes • A resource can be a member of multiple Dynamical Volumes • Tracks depot reliability and space Arrow direction signifies which component initiates communication *http://www.nevoanetworks.com/

  42. L-Store ArchitectureMetadata • L-Store DB • Tree-based structure to organize objects • Arbitrary number of key/value pairs per object • All MD operations are atomic and only operate on a single object • Multiple key/value pairs can be updated simultaneously • Can be combined with testing (stored procedure) Arrow direction signifies which component initiates communication

  43. L-Store ArchitectureDepot • Used to store large objects. Traditionally file contents. • IBP storage • Fault tolerance is done at the L-Store level. • Clients directly access the depots • Depot-to-depot transfers are supported. • Multiple depot implementations • ACCRE depot* is actively being developed Arrow direction signifies which component initiates communication *http://www.lstore.org/pwiki/pmwiki.php?n=Docs.IBPServer

  44. L-Store ArchitectureMessaging Service • Event notification system • Uses the Advanced Messaging Queuing Protocol • L-Store provides a lightweight layer to abstract all queue interaction Arrow direction signifies which component initiates communication

  45. L-Store ArchitectureServices • Mainly handles events triggered by metadata changes • L-Server commands and other services can also trigger events. • Typically services are designed to handle operations that are not time sensitive. Arrow direction signifies which component initiates communication

  46. L-Store ArchitectureLock Manager • Variation of the Cassandra metadata nodes but focusing on providing a simple locking mechanism. • Each lock is of limited duration, just a few seconds, and must be continually renewed. • Guarantees that a dead client doesn’t lock up the entire system • To prevent deadlock a client must present all objects to be locked in a single call. Arrow direction signifies which component initiates communication

  47. Current functionality • Basic functionality • Ls, cp, mkdir, rmdir, etc • Augment – Create an additional copy using a different LUN • Trim – Remove a LUN copy • Add/Remove arbitrary attributes • RAID5 and Linear stripe supported • FUSE mount for Linux and OSX • Used for CMS-HI • Basic messaging and services • L-Sync, similar to DropBox • Tape archive or HSM integration

  48. 3 GB/s 30 Mins L-Store Performance circa 2004 • Multiple simultaneous writes to 24 depots. • Each depot is a 3 TB disk server in a 1U case. • 30 clients on separate systems uploading files. • Rate has scaled linearly as depots added. • Latest generation depot is 72TB with 18Gb/s sustained transfer rates

  49. Summary • REDDnet provides short term working storage • Scalable in both metadata (op/s) and storage (MB/s) • Block level abstraction using IBP • Apache Cassandra for metadata scalability • Flexible framework supporting the addition of user operations

More Related