Finding a needle in Haystack Facebook’s Photo Storage

Finding a needle in Haystack Facebook’s Photo Storage Shakthi Bachala

Outline • Scenario • Goal • Problem • Previous Approach • Current Approach • Evaluation • Advantages • Critic • Conclusion

Scenario :

Goal • High throughput and low latency • Fault-tolerant • Cost-effective • Simple

Previous Approach : Typical design for Photo Sharing

Previous Approach : NFS based design for Photo Sharing at facebook

Previous Approach – NFS based design • Traditional file system architecture performs poorly under Facebook's kind of workload • NFS - based Design: CDN effectively serves the hottest photos (profile pictures and recently updated photos), but facebook also generates a lot of requests for less popular images (long tail images). These are not handled by CDN • Normal website had 99% CDN hit rate but facebook had around 80% CDN hit rate

Long Tail Issue

Previous Approach cont.. Problems with that approach were: • Wastage of storage capacity due to metadata • Large metadata per file • Each image stored as a file • Large number of disk operations for reads • Because of large directories (large directories containing thousands of files) • Change of the directory structures and changing from large directories to small directories has brought down the iops approximately from 10 to 2.5-3.0

Current Approach – Haystack Architecture

Current Approach- Haystack Components The main components of Haystack architecture are: • Haystack Directory • Haystack Cache • Haystack Store

Current Approach- Haystack Directory The main goals of directory are: • Map logical volumes to physical volumes • 3 Physical volumes( on 3 nodes) per one logical volume • Load balance • Writes across logical volumes • Reads across physical volumes (any of the 3 stores) • Caching strategy: Whether the photo request should be handled by the CDN or by the cache • URL generation http://<CDN>/<Cache>/<Node>/<Logical volume id, Image id> • The directory would Identify the logical volumes that are read only either because of operational reason or because those volumes have reached their storage capacity

Current Approach- Haystack Cache • Approach: • The Cache receives HTTP requests for photos from browser or CDNs • It is a distributed hash table with photo id as the key to locate the cached data • If the photo id is missing in cache , the cache fetches the data from photo server and replies it to the browser or CDN depending on the request

Current Approach- Haystack Cache Caches a photo if it satisfies the following two conditions: • The request directly come from a user and instead of CDN • Facebook’s experience with the NFS-based design showed post-CDN caching is ineffective as it is unlikely that a request misses in the CDN would hit in our internal cache • The photos is fetched by the write enabled store • Photos are most heavily accessed soon after they are uploaded • File systems generally work better when doing either writes or reads but not both

Current Approach- Haystack Cache Hit Rate

Current Approach : Haystack Store • Replaces the storage and photo server layer in NFS based Design with this structure:

Current Approach : Haystack Store • Storage : • 12x 1TB SATA, RAID6 • Filesytem: • Single approx. 10 TB xfs filesystem. • Haystack: • Log structured , append only object store containing needles as object abstractions • 100 haystacks per node each 100GB in size

Current Approach: Haystack Store File

Current Approach: Operations in Haystack • Photo Read • Look up offset /size of the image in the incore index • Read Data (approx. 1 iop) • Photo Write • Asynchronously append images one by one to the haystack file • Next haystack file when becomes full • Asynchronously append index records to the index file • Flush index file if too many dirty index records • Update incore index

Current Approach: Operations in Haystack • Photo Delete • Lookup offset of the image in the incore index • Mark the image needle flag as “DELETED” • Update incore index • Index File: • Provides minimum metadata to locate the needle in the Haystack store • Subset of Header metadata

Current Approach: Haystack Index File

Haystack Based Design - Photo Upload

Haystack Based Design - Photo Download

Current Approach: Operations in Haystack • Filesystem: • Haystack uses XFS, an extent based file system • It has two main advantages: • The block maps for several contiguous large files can be small enough to be stored in the main memory • XFS provides efficient file pre allocation, mitigating fragmentation and reigning in how large block maps can grow

Current Approach: Haystack Optimization • Compaction: • Infrequent online operation • Create a copy of haystack skipping duplicates and deleted photos • The patterns of deletes to photo views, young photos are a lot more likely to be deleted • Last year about 25% of the photos got deleted

Current Approach: Haystack Optimization • Saving More Memory: • With the following two techniques store machines reduced their main memory footprints by 20% • Eliminate the need for an in-memory representation of flags by setting the offset to be 0 for deleted photos. • Store machine do not keep track of cookie values in main memory and instead check the supplied cookie after reading from the disk

Current Approach: Haystack Optimization • Batch Uploads: • Disks perform better with large sequential writes instead of small random writes, so facebook uses batch uploads whenever possible • Many users upload entire albums to facebook instead of each picture which gives an opportunity to batch the uploads

Evaluation -Data

Evaluation – Production Workload

Advantages • Simple design • Decrease number of disk operations by reducing the average metadata per photo • This system is robust enough to handle a very large amount of data • Fault Tolerant

Critic • I thought this approach is very facebook specific . • Any other?

Conclusion • Built a simple but robust data storage mechanism for facebook photo storage to accommodate long tail of photo requests which was not possible by previous approaches

References • http://www.usenix.org/event/osdi10/tech/ • http://www.cs.uiuc.edu/class/sp11/cs525/sched.htm

Finding a needle in Haystack Facebook’s Photo Storage