180 likes | 291 Views
BlueSSD: Distributed Flash Store for Big Data Analytics. Sang Woo Jun, Ming Liu, Kermin Fleming, Arvind Computer Science and Artificial Intelligence Laboratory MIT. Introduction – Flash Storage. Low latency, high density Throughput per chip is fixed
E N D
BlueSSD: Distributed Flash Store for Big Data Analytics Sang Woo Jun, Ming Liu, Kermin Fleming, Arvind Computer Science and Artificial Intelligence Laboratory MIT
Introduction – Flash Storage • Low latency, high density • Throughput per chip is fixed • Many chips are organized into multiple busses that can work concurrently • High throughput is achieved with more busses • Read/write speed difference, limited write lifetime • Not the main focus… yet
Flash Deployment Goals • High Capacity / Low Unit Cost • COREFU - Share distributed Storage over commodity network • TBs of storage at <1ms latency, 1GB throughput at high distribution • High Throughput / Low Latency • FusionIO - Maximum performance using many busses/chips and PCIE • 100s of GB at 100s of us latency, 3GB throughput
BlueSSD – Best of Both Worlds • Shared distributed storage over faster custom network to accelerate big data analytics • PCIE • 8x PCIe 2.0 (~1GB/s) • Inter-FPGA SERDES • Low latency sideband network (<1us, ~1GB/s) • Automatic network/flow control synthesis
The Physical System (Old) PCIe (~1GB/s) Sideband Link (~1GB/s) Flash Board (~80MB/s)
System Configuration • 6 Xilinx ML605 Development Boards + Hosts • 4 Custom Flash Boards • 4 busses with 8 chips, 16GB per board • 2 Xilinx XM104 Connector Expansion Boards • 5 SMA Connections SMA Hub node FPGA FPGA XM014 XM014 SMA Host PC PCIE FPGA1 FPGA2 FPGA3 FPGA4 Custom Flash Board Custom Flash Board Custom Flash Board Custom Flash Board Storage Node The ML605 only has one SMA port, requiring hubs
System Configuration • Single software host can access all nodes • All nodes have identical memory maps of the entire address space • Requests are redirected to nodes that have the data SMA FPGA FPGA XM014 XM014 SMA Host PC PCIE FPGA1 FPGA2 FPGA3 FPGA4 Custom Flash Board Custom Flash Board Custom Flash Board Custom Flash Board
Network Flash Controller Requests Data PCIE Host PC Client Interface SMA Address Mapping FPGA Remote Node XM014 Host PC PCIE Flash Controller FPGA1 FPGA1 Custom Flash Board Flash Board
Network Hub • Programmatically define high-level connections • N-to-N crossbar-like network is generated SMA FPGA1 FPGA1 ML605 ML605 ML605 XM014 FPGA2 FPGA2 ML605 ML605 FPGA3 FPGA3 FPGA4 FPGA4
Software • FUSE provides a file system abstraction • Custom FUSE module interfaces with FPGA • The entire storage can be accessed as a single regular file • Currently running SQLite off-the-shelf • How to benchmark? SQLite stdio File System FUSE PCIE Driver FPGA
Storage Structure • Focusing on read-intensive workloads • Writes are done offline, no coherence issues • Address is striped across FPGAs • Concurrent writes will require more than coherence • SQLite assumes exclusive access to storage • If we are to have more than one file, file system metadata will need o be synchronized
Performance Measurement Throughput bottlenecked by custom flash card *COREFU performance at 32 nodes
Scalability • Latency increase is small enough to accommodate 16+ FPGAs • Single SMA cable can accommodate 10+ Flash board throughput • More should be possible with good topology • Different story if flash boards are faster(link compression?)
Future Work (1) • Bring up the 4 node system • Bring up the 8 node system • 8 more ML605 boards have been asked from Xilinx • More capacity + throughput
Future Work (2) • Offload computation to FPGA • Do computation near storage • Relational algebra processor • Complex analytics? • Looking for interesting application
Future Work (3) • Multiple concurrent writers • Software level transaction management • Hardware level pseudo-filesystem is probably required
The End • Thank you!