680 likes | 758 Views
myS3 Fabrizio Manfredi Furuholmen Federico Mosca. Agenda. Introduction Goals P rincipals myS3 Architecture Internals Sub project Conclusion Developments. Unsolved problem. Web Interface .
E N D
Agenda • Introduction • Goals • Principals • myS3 • Architecture • Internals • Sub project • Conclusion • Developments
Web Interface “Amazon S3 provides a simple web-services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web…”
S3 • Every file you upload to Amazon S3 is stored in a container called a bucket. • Each bucket name should be unique. • Each bucket can contain an unlimited number of object (key/value). • Buckets cannot be nested, you can not create a bucket within a bucket. • Object • Id • Version • Metadata • Subresources • ACL • Http Rest Call • Byte range transfer • Parallel transfer
myS3 Translate S3 Request to local Disk
Mapping • S3 Bucket is a directory in the AFS space • S3 Object is file or a directory, the directory • S3 ACL Fake object AFS ACL permission are returned as a S3 metadata unix permission are returned as a S3 metadata • All other S3 features are not implemented
S3 Request Objects in the same bucket don’t have any relation !!! No Hierarchically GET /mybucket/puppy.jpg GET /mybucket/yesterday/puppy.jp “yesterday” doesn’t exist GET /mybucket/puppy.jpg HTTP/1.1 User-Agent: dotnet Host: s3.amazonaws.com Date: Tue, 15 Jan 2008 21:20:27 +0000 x-amz-date: Tue, 15 Jan 2008 21:20:27 +0000 Authorization: AWS AKIAIOSFODNN7EXAMPLE:k3nL7gH3+PadhTEVn5EXAMPLE
S3 Request • For retrieving directory content : • Prefix for the parent directory • ‘/’ for end name Delimiter • For create a Directoy • Object name with ‘/’ at the end <ListBucketResultxmlns="http://s3.amazonaws.com/doc/2006-03-01/"> <Name>ExampleBucket</Name> <Prefix>/mydir/</Prefix> <Marker></Marker> <MaxKeys>1000</MaxKeys> <Delimiter>/</Delimiter> <IsTruncated>false</IsTruncated> <Contents>
AWS Auth Authorization = "AWS" + " " + AWSAccessKeyId + ":" + Signature; Signature = Base64( HMAC-SHA1( YourSecretAccessKeyID, UTF-8-Encoding-Of( StringToSign ) ) ); StringToSign = HTTP-Verb + "\n" + Content-MD5 + "\n" + Content-Type + "\n" + Date + "\n" + CanonicalizedAmzHeaders + CanonicalizedResource; CanonicalizedResource = [ "/" + Bucket ] + <HTTP-Request-URI, from the protocol name up to the query string> + [ subresource, if present. For example "?acl", "?location", "?logging", or "?torrent"]; CanonicalizedAmzHeaders = <described below>
Authentication IP Base Computer Account, the authentication of the users is handle by internal db Impersonate Forge the ticket for the users on the server side, the authentication is handle by internal db Token Generation Web interface authentication( kbrauth), one time AWS token generation
Server Architecture S3 Interface Web Interface Interface Bucket Manager Storage Manager Token Manager Managers Auth Manager Storage Driver Cache Drivers Plugin /afs
InternalDB Bucket DB - Contains the map btw the bucket name and the AFS Path ex. Myhome -> /afs/beolink/home/manfred Token DB - Contains the access key and secret key for Amazon Authentication, with web base authentication the db contains the kerberos token
Storage Manager NFS style Most of the operation are made on temporary file (.NFSXXX) Caching Save temporary file in non AFS space NoWait Return Ok as soon the file is on the S3 server Mem Keep file transferred in memory (max 100MB) ACL Enable write operation on AFS ACL MD5 Enable or disable MD5
TODO Parallel Transfer Locking Kerberos Token base Chunk transfer (http 100)/ byte range transfer Create a interface for CloudStack Automatic Volume release
GOAL Create a framework for testing a new technologies and paradigm
Principle 1/3 “Moving Computation is Cheaper than Moving Data”
Principle 2/3 “There is always a failure waiting around the corner” *Werner Vogel
Principle 3/3 “Decompose into small loosely coupled, stateless building blocks” *’ Leaving a Legacy System Revisited’ Chad Fowler
Object Data Metadata Block 1 Properties Hash Serial ACL Object Block 2 Serial Hash Ext Properties Serial Block … Hash Segments Serial Block n Hash Attributes set by user Serial
BucketDiscovery Cell 1 DNS Lookup Bucket name Bucket name Cell RL IP list N server Client Server list + Load info N server Server list priority List Cell 2
RestFS Cache client side ServerList DNS Resource Locator Tokens Cache Federated Auth Temporary Pub/Sub List Callbacks Locks Metadata cache RestFS Metadata Persistent Block cache RestFS Block RestFS Block RestFS Block
Server Architecture S3 RestFS RPC Auth Token Resource Locator Sub/Pub Interface Service Storage Mgr Meta Mgr Locks Mgr Auth Manager Token Manager Resource Manager Callbacks Manager Managers Auth Service Token Service RL Service Callback Service Meta Service Block Service Locks Service Distributed Cache Storage Driver Meta Driver Locks Driver Drivers Plugin Resource Driver Callbacks Driver Auth Driver Token Driver Backends
Mounting Cell Bucket N Objects Cell Bucket N Objects
Object Versioning Cell The segment contain the diff to upstream object Bucket N Objects Objects Objects Each object knows the previous and the next. The current object knows the previous and the last
Backend: ConsistentHashing Number of key to move for add/remove a node : Keys/Node= keys to relocate Blocks are collected in shards http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/
Block Storage • AFS - Volume store a range of HASH - Chunk is write in 3 volume - Server • PISA - cluster of node - communication base on zmq - consensus base on raft • CEPH - Use CEPH node directly
Backend: Storage • 3 Copies • Configurable read and write consistent level and security: • 2W1R • 2W2R • 1W1R • … Monitor of neighbored small cluster of 3 nodes (GOSSIP) Mini cluster election key space reclaim for replica coordination, leave join cluster
RestFS Protocol WebSocket is a web technology for multiplexing bi-directional, full-duplex communications channels over a single TCP connection. GET /mychat HTTP/1.1 Host: server.example.com Upgrade: websocket Connection: Upgrade Sec-WebSocket-Key: x3JJHMbDL1EzLkh9GBhXDw== Sec-WebSocket-Protocol: chat Sec-WebSocket-Version: 13 Origin: http://example.com Standard HTTP/HTTPS port JSON-RPC is lightweight remote procedure call protocol similar to XML-RPC. It's designed to be simple --> { "method": ”readBlock", "params": [”…"], "id": 1} <-- { "result": [..], "error": null, "id": 1} Simple to covert in python dict BSON short for Binary JSON, is a binary-encoded serialization of JSON-like documents.. BSON can be compared to binary interchange formats {"hello": "world"} → "\x16\x00\x00\x00\x02hello\x00 \x06\x00\x00\x00world\x00\x00" *Compression is a long story…
Protocols Metadata • { "method": ”readBlock", "params": [“ • bucket_name: test, • segment:1 , • blocks:[1,2,3,4]"], • "id": 1} Collecting per segment Parallel request per segment • { "method": ”getSegmentVer", "params": [“ • bucket_name: test, • segment:1 , • , "id": 1} • <-- { "result": [ • ver: 1335519328.091779 • ], • "error": null, "id": 1} Check cached Data • { "method": ”getSegmentHash", "params": [“ • bucket_name: test, • segment:1 , • , "id": 1} • <-- { "result": [ • 1:16db0420c9cc29a9d89ff89cd191bd2045e47378 • 2:9bcf720b1d5aa9b78eb1bcdbf3d14c353517986c • … • ], • "error": null, "id": 1} Block hash list for a specific segment
Redis performance $ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 –q SET: 552028.75 requests per second GET: 707463.75 requests per second LPUSH: 767459.75 requests per second LPOP: 770119.38 requests per second
Pluggable Interface, dynamic load
Thankyouhttp://restfs.beolink.orgmanfred.furuholmen@gmail.comfege85@gmail.comThankyouhttp://restfs.beolink.orgmanfred.furuholmen@gmail.comfege85@gmail.com
Bucket The bucket has many properties, the property element is a collection of object information, with this element you can retrieve the default value for the bucket (logging level, security level, ect). • Properties objects: • Property • Property Ext • Property ACL • Property Stats Bucket Name BucketName zebra Property segment_size= 512 block_size= 16k max_read’=1000 Bucket_size=0 Bucket_quota=10000 storage_class=STANDARD compression= none logging=enable bucket_type=fs … Default parameters - Filesystm, The bucket is used as a filesystem - Logging, Logging operation done on the specific Bucket - Replica RO, Bucket shadow replication …Custom definition Python Dict
Object Data Metadata Block 1 Properties Hash Serial ACL Object Block 2 Serial Hash Ext Properties Serial Block … Hash Segments Serial Block n Hash Attributes set by user Serial
MetaDataProperties Object type: - Data, Contains files - Folder, Special object that contain others objects - Mount point, Contains the name of the buckets - Link, Contains the name of the objects - Immutable, Gold image Custom, Defined by the users Bucket name Object id (Special id is : bucket_name.ROOTis the starting point of the file system ) Object zebra.c1d2197420bd41ef24fc665f228e2c76e98da247 Property Object_type=data segment_size= 512 block_size= 16k content_type = md5=ab86d732d11beb65ed0183d6a87b9b0 max_read’=1000 storage_class=STANDARD compression= none Name=“my first object” Object_size=10000 Object_prev=zebra.c1d2197420bd41ef24fc665f228e2c76e98dartg … vers:1335519328.091779 Object hash (replaced by merkel tree) Pointer to the previous Object Object default Object version
MetadataSegment Data_size ------------------------------------- = Total Segment block_size*segment_size Segment element Segment Segment-1 Segment-id 1:16db0420c9cc29a9d89ff89cd191bd2045e47378 2:9bcf720b1d5aa9b78eb1bcdbf3d14c353517986c 3:158aa47df63f79fd5bc227d32d52a97e1451828c 4:1ee794c0785c7991f986afc199a6eee1fa4 5:c3c662928ac93e206e025a1b08b14ad02e77b29d … vers:1335519328.091779 … Block pos: integrity hash Version base on timestamp + Incremental useful for vector clock conflict resolution Python Dict
Restfs ID Id Bucket Plain text DNS name Id Object UUID random generation Id segment and id block Base on the position of the content Chunck data on the storage SHA-1 hash of the concatenation of Bucket.object.segment.block_id Id Object is unique inside of the Bucket, with bucket name the id is a UUID
Mounting Cell Bucket N Objects Cell Bucket N Objects
Object Versioning Cell The segment contain the diff to upstream object Bucket N Objects Objects Objects Each object knows the previous and the next. The current object knows the previous and the last