Preliminary Results

Optimizing Storage for Web-Scale Search Application • HemantSurale, Varun Chandramouli, Shashank Gupta, SoumenChakrabarti, Ameya Usgaonkar, Atish Kathpal, Motivation Application Architecture • Need for a DFS • Local File-Systems are not scalable • A lot of manual work is required in case of node failures, adding a new node etc. • Problems with HDFS • Single point of failure • Can not be mounted as a POSIX file system • Fuse-mount leads to poor performance • Ceph FS distributes data pseudo-randomly • Leads to a lot of network traffic • Sub-optimal performance for applications with known data patterns Approach • Instrument appropriate application specific CRUSH maps to minimize read-write interference • During Indexing phase, data is read from one OSD and the index is written back into the cluster, leading to a read-write interference • By setting aside certain OSDs only for writing of index data, we eliminate this interference • Instrument a file to primary OSD mapping to reduce network traffic • Creating the map adds an overhead • Using this map, we can have each node process only the data local to it, reducing network traffic • Obtain IO and Network traces to analyze performance further • Use machine-learning to observe data access patterns and optimize data placement • XFS • Corpus is divided into ten stripes • Each stripe is replicated among four buddies • Each buddy generates the index for 1/4th of its data i.e. 1/40th of the corpus • The indexes generated by each buddy are merged, and written back to them. • Issues to be addressed with regard to Ceph • Due to random placement of data, the location of the files is unknown • This makes it difficult to maximize local reads and writes, leading to a lot of network traffic • No two nodes/OSDs are exact replicas of each other. This leads to indexing of the corpus 4 times. Preliminary Results • Performance of Vanilla Ceph vs. Modified Ceph • The CRUSH Maps were modified to minimize the read-write interference • Tests on a sample application showed a 33% improvement over Vanilla Ceph. • Performance of Reads only vs. Reads and Writes • Performance decreases in case of reads and writes • This decrease is attributed to read-write interference Future Work • Obtain performance results on actual cluster using modified CRUSH Maps • Obtain IO and network traces and analyze them to see if there is potential to improve performance • Obtain performance results using file-OSD mapping • Compare reduction in network traffic • Reduce overhead of producing the file-OSD map • Benchmark the performance of the system against the native XFS and Hadoop-MR

Preliminary Results

Preliminary Results

Presentation Transcript

Preliminary Results

Preliminary Results

Preliminary results

Preliminary Results

PRELIMINARY RESULTS 2003

Preliminary Results

Preliminary Results

Preliminary Results

Preliminary Results 2001

Preliminary Results Presentation

Preliminary Results

Preliminary Results

Preliminary Results

Preliminary Results

Preliminary Results

Preliminary Results

Preliminary Results

Preliminary Results

Preliminary Results

Preliminary Results

Preliminary Results 2007

Preliminary TNA results