170 likes | 321 Views
A New Approach to File System Cache Writeback of Application Data. Sorin Faibish – EMC Distinguished Engineer P. Bixby, J. Forecast, P. Armangau and S. Pawar EMC USD Advanced Development. SYSTOR 2010, May 24-26, 2010, Haifa, Israel. Outline. Motivation: changes in servers technology
E N D
A New Approach to File System Cache Writeback of Application Data Sorin Faibish – EMC Distinguished Engineer P. Bixby, J. Forecast, P. Armangau and S. Pawar EMC USD Advanced Development SYSTOR 2010, May 24-26, 2010, Haifa, Israel
Outline • Motivation: changes in servers technology • Cache writeback problem statement • Monitoring behavior of application data flush • Cache writeback as a closed loop system • Current cache writeback methods are obsolete • I/O “slow down” problem • New algorithms for cache writeback • Simulation results of new algorithms • Experimental results of a real NFS server • Summary and conclusions • Future work and extension to Linux FS
Motivation: changes in servers technology • Large numbers of cores in CPUs – more computing power • Large cheaper memory caches – cached data very large • Very large disk drives – but modest increase in disk throughput • Application data I/O increased much faster – but require constant flush to disk • Cache writeback is used to smooth bursty I/O traffic to disk • Conclusion: cache writeback of large amounts of application data is slower
Cache writeback problem statement • I/O speeds increase forcing caching large amounts of dirty pages at servers to hide disk latency • Large number of clients access servers increasing burstiness of disk I/O and need for cache • Large caches of the FS and servers allow longer retention • Cache writeback flush is based on cache fullness metrics • Flush to disk is done at maximum speed when cache full leaving no room for additional I/Os • As long as cache is full I/Os will have to wait for empty cache pages availability – I/O “stoppage” • Result application performance is lower than disk performance
Monitoring behavior of application data flush • Understanding the problem: • Instrument kernel to measure cache Dirty Pages dynamics • Monitor the behavior of DP in Buffer Cache • Run benchmark multi-client application
Cache writeback as a closed loop system • Application controls the flush using I/O commit based on application cache state • DP in cache are difference between incoming I/O and DP flushed to disk • Goal is to keep difference/error zero • The error loop is closed as application send commits after each I/O • Cache Writeback is controlled by application • Flush to disk based on state of fullness of the Buffer Cache • Cache control mechanism ensure cache availability for new I/Os • DP in cache like water in tank • Water level is controlled by cache manager to prevent overflow • No relation between application I/O arrival and when the I/O is flush to disk • Result in large delays between I/O creation and I/O on disk – open loop • Cache writeback is controlled by algorithm
Current cache writeback methods • Trickle flush of DPs • Flush based on proportion of incoming application I/Os (rate based) • Use low priority to reduce CPU consumption • Background task with low efficiency • Used only to reduce memory pressures • Cannot address high bursts of I/O • Watermark based flush of DPs • Inspired from database and transactional applications • Cache writeback triggered by number/proportion of DP in the cache • There is no prediction of high I/O bursts – disadvantage for multi-clients • Flush is done at maximum disk speed to reduce latency • Close to incoming I/O rate for small caches – flush often • Inefficient for very large caches • Interfere with metadata and read operations
Current cache writeback deficiency • Watermark based flush of DPs is similar a non-linear saturation effect in the cache closed loop • Introduces oscillations in the DP behavior due to the saturation • The oscillation introduces additional I/O latencies to the disk latencies • Creates burstiness to the disk I/O – reduce aggregate performance
I/O “slow down” problem • Application data flush require FS MD updates to same disks • Flush is triggered when high watermark threshold is crossed • Watermark based flushes cannot throttle the I/O speed as it is an ultimate resort before kernel crash on starvation • Additional I/Os are slowed down until the MD is flushed for the new arriving I/Os • Even if NVRAM is used the DP need to be removed from cache to make room for additional I/Os • Application I/Os latency increases until the cache is freed – “slow down” • In worst cases the latency is so high that resemble to a I/O stoppage • If additional burst of I/Os on other new clients there is no room to put I/Os and new I/Os will wait until the watermark goes under low watermark - stoppage
New algorithms for cache writeback • Trying to address deficiency of current cache writeback methods • Inspired from control system and signal processing theory • Use adaptive control and machine learning methods • Utilize better modern HW characteristics • The goals of the solution are: • Reduce the I/O slowdown limited only by maximum disk I/O throughput • Reduce to minimum disk I/O burstiness and • Maximize aggregate I/O performance of the system (benchmark) • Same algorithms apply to network as well as local FSs • All the algorithms can be used for application DPs and MD DPs flush
New algorithms for cache writeback (cont.) • We present and simulate only 5 algorithms (more were considered): • Modified Trickle Flush – improved version of trickle by changing priority and use more CPU • Fixed Interval Algorithm – use a goal as target of number of DPs similar to watermark methods but compensate better for bursts of I/O (semi-throttling) by pacing the flush to disk • Variable Interval Algorithm – use an adaptive control scheme that adapt the time interval based on the change in DP during previous interval similar to trickle but with faster adaptation in response to I/O bursts • Quantum Flush – use the idea of lowest retention of DP in cache similar to watermark based methods but adapt flush speed proportional to number of new I/Os in the previous sample time • Rate of Change Proportional Algorithm – flushes DPs proportional to the first derivative of the number of DPs using fixed interval and a forgetting factor proportional to difference between I/O rate and maximum disk throughput: c = R * (t - ti ) + W * μ μ = α * (B – R) / B
Simulation results of new algorithms • Selection of best algorithm by: • Optimal behavior to unexpected bursts of I/Os • Flush best matching the rate of change in DPs in the cache (minimum DP level) • Minimize I/O slow down to clients (reduce I/O average latency) • Rate of change based algorithm with forgetting factor was best
Experimental results of a real NFS server • We implemented the Modified Trickle and Rate Proportional algorithms on the Celerra NAS server • Used SPEC sfs2008 benchmark and measured the number of DP in cache with 4 msec resolution • Experimental results show some I/O slowdown using the MT algorithm resulting in 92K NFS iops (diagrams sampled at same 55K NFS iops level) • The Rate Proportional algorithm show much shorter I/O slow down time resulting in 110.6K NFS iops
Summary and conclusions • Discussed new algorithms and paradigm to address the cache writeback in modern FS and servers • Discussed how the new algorithm can reduce the impact of bursts of application I/Os to the aggregate I/O performance otherwise bounded by the maximum disk speeds • We show how current cache writeback algorithms create I/O slowdown at I/O speeds that are lower than disk speed but changing rapidly • We presented reduced number of algorithms that are presented in the literature explaining their deficiencies • We discuss several new algorithms and show simulation results that allowed us to select the best algorithm for experimentation • We presented experimental results for 2 algorithms and show that Rate Proportional is the best algorithm based on the given criteria of success • Finally we discuss how these algorithms can be used for MD and DP on any file system network or local
Future work and extension to Linux FS • Investigation of additional algorithms inspired from signal processing of non-linear signals that address oscillatory behavior • Address similar behavior for cache writeback of local file systems including ext3, ReiserFS and ext4 in Linux OS (a discussion at next Linux workshop) • Linux FS developers are aware of this behavior and currently work to instrument the Linux kernel with same measurement tools as we used • We are also looking to use machine learning in order to be able to compensate for very fast I/O rate changes that will allow to optimize application performance for very large number of clients • Additional work is needed to find algorithms that will allow the maximum application performance equal the maximum aggregate disk performance • We are also looking to instrument NFS clients’ kernel to allow us evaluate the I/O slow down and tune the flush algorithm to reduce the slow down effect to zero • More work is needed to extend this study to MD and find new MD specific flushing methods