10 likes | 143 Views
Optimizing Storage for Web-Scale Search Application. Hemant Surale , Varun Chandramouli, Shashank Gupta, Soumen Chakrabarti , Ameya Usgaonkar, Atish Kathpal,. Motivation. Application Architecture. Need for a DFS Local File-Systems are not scalable
E N D
Optimizing Storage for Web-Scale Search Application • HemantSurale, Varun Chandramouli, Shashank Gupta, SoumenChakrabarti, Ameya Usgaonkar, Atish Kathpal, Motivation Application Architecture • Need for a DFS • Local File-Systems are not scalable • A lot of manual work is required in case of node failures, adding a new node etc. • Problems with HDFS • Single point of failure • Can not be mounted as a POSIX file system • Fuse-mount leads to poor performance • Ceph FS distributes data pseudo-randomly • Leads to a lot of network traffic • Sub-optimal performance for applications with known data patterns Approach • Instrument appropriate application specific CRUSH maps to minimize read-write interference • During Indexing phase, data is read from one OSD and the index is written back into the cluster, leading to a read-write interference • By setting aside certain OSDs only for writing of index data, we eliminate this interference • Instrument a file to primary OSD mapping to reduce network traffic • Creating the map adds an overhead • Using this map, we can have each node process only the data local to it, reducing network traffic • Obtain IO and Network traces to analyze performance further • Use machine-learning to observe data access patterns and optimize data placement • XFS • Corpus is divided into ten stripes • Each stripe is replicated among four buddies • Each buddy generates the index for 1/4th of its data i.e. 1/40th of the corpus • The indexes generated by each buddy are merged, and written back to them. • Issues to be addressed with regard to Ceph • Due to random placement of data, the location of the files is unknown • This makes it difficult to maximize local reads and writes, leading to a lot of network traffic • No two nodes/OSDs are exact replicas of each other. This leads to indexing of the corpus 4 times. Preliminary Results • Performance of Vanilla Ceph vs. Modified Ceph • The CRUSH Maps were modified to minimize the read-write interference • Tests on a sample application showed a 33% improvement over Vanilla Ceph. • Performance of Reads only vs. Reads and Writes • Performance decreases in case of reads and writes • This decrease is attributed to read-write interference Future Work • Obtain performance results on actual cluster using modified CRUSH Maps • Obtain IO and network traces and analyze them to see if there is potential to improve performance • Obtain performance results using file-OSD mapping • Compare reduction in network traffic • Reduce overhead of producing the file-OSD map • Benchmark the performance of the system against the native XFS and Hadoop-MR