1 / 1

Preliminary Results

Optimizing Storage for Web-Scale Search Application. Hemant Surale , Varun Chandramouli, Shashank Gupta, Soumen Chakrabarti , Ameya Usgaonkar, Atish Kathpal,. Motivation. Application Architecture. Need for a DFS Local File-Systems are not scalable

neil
Download Presentation

Preliminary Results

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Storage for Web-Scale Search Application • HemantSurale, Varun Chandramouli, Shashank Gupta, SoumenChakrabarti, Ameya Usgaonkar, Atish Kathpal, Motivation Application Architecture • Need for a DFS • Local File-Systems are not scalable • A lot of manual work is required in case of node failures, adding a new node etc. • Problems with HDFS • Single point of failure • Can not be mounted as a POSIX file system • Fuse-mount leads to poor performance • Ceph FS distributes data pseudo-randomly • Leads to a lot of network traffic • Sub-optimal performance for applications with known data patterns Approach • Instrument appropriate application specific CRUSH maps to minimize read-write interference • During Indexing phase, data is read from one OSD and the index is written back into the cluster, leading to a read-write interference • By setting aside certain OSDs only for writing of index data, we eliminate this interference • Instrument a file to primary OSD mapping to reduce network traffic • Creating the map adds an overhead • Using this map, we can have each node process only the data local to it, reducing network traffic • Obtain IO and Network traces to analyze performance further • Use machine-learning to observe data access patterns and optimize data placement • XFS • Corpus is divided into ten stripes • Each stripe is replicated among four buddies • Each buddy generates the index for 1/4th of its data i.e. 1/40th of the corpus • The indexes generated by each buddy are merged, and written back to them. • Issues to be addressed with regard to Ceph • Due to random placement of data, the location of the files is unknown • This makes it difficult to maximize local reads and writes, leading to a lot of network traffic • No two nodes/OSDs are exact replicas of each other. This leads to indexing of the corpus 4 times. Preliminary Results • Performance of Vanilla Ceph vs. Modified Ceph • The CRUSH Maps were modified to minimize the read-write interference • Tests on a sample application showed a 33% improvement over Vanilla Ceph. • Performance of Reads only vs. Reads and Writes • Performance decreases in case of reads and writes • This decrease is attributed to read-write interference Future Work • Obtain performance results on actual cluster using modified CRUSH Maps • Obtain IO and network traces and analyze them to see if there is potential to improve performance • Obtain performance results using file-OSD mapping • Compare reduction in network traffic • Reduce overhead of producing the file-OSD map • Benchmark the performance of the system against the native XFS and Hadoop-MR

More Related