A Semantic-based Cache Replacement Algorithm for Mobile File Access

A Semantic-based Cache Replacement Algorithm for Mobile File Access Sharun Santhosh and Weisong Shi Department of Computer Science Wayne State University weisong@wayne.edu http://mist.cs.wayne.edu

Motivation • The Future • Staying connected anywhere, anytime will become a reality • How ? • Cable modem or DSL connection at home • High speed Ethernet network at work or school • Satellite network in the car • WiFi network at the airport or the neighborhood coffee shop • Challenges • Effectiveness - Adapt to the various underlying connectivity • Convenience - Adaptation should be transparent to the user • Security – secure access in resource constraint devices

Heterogeneous Environment 802.11a,b,g Local Area Network wLAN Bluetooth Personal Area Network (PAN) Wide Area Network (WAN) WirelessBridge LAN GPS <1Mbs • Access • Synchronization • 10 Meters WorkgroupSwitches GSM/CDMA <100Mbs 9.6 Kbit/s <2Mbs • Access • “hot spots” • LAN equivalent • Voice • SMS • e-Mail • Web browsing • mCommerce • Internet access • Document transfer • Low/high quality video

Adaptive Communication Optimization (Fractal) SemanticbasedCaching Our Solution CEGOR ClosE and Go, Open and Resume Connection View based Secure and Transparent Reconnection

Roadmap • Motivation • Caching • Semantic-based Caching • Simulation Results • Conclusion

Caching • Three basic steps involved in accessing data anywhere and anytime. • Retrieve the files from the server • Work on them locally • Write the changes back to the server • A Cache optimizes this process • Reduce frequency of disk operations performed • Reduce frequency of requests to the fileservers • Reducing network load • Problem being addressed • Minimization of Communication

Why Study Caching? • It has been studied extensively yet LRU is the most commonly used algorithm • Used in NFS, AFS, Sprite, CODA and most operating systems buffer caches • Why ? • It’s simple to implement. • Cache misses are acceptable in existing systems. • Number of files replaced do not matter • high hit ratio vs. # of replacement • But in a heterogeneous environment • Each miss implies additional communication • Storage of work (when in a weakly connected or disconnected state) • Cannot assume a reliable link exists with the server

Usage Scenario “Imagine a field engineer is accessing layout diagrams for a faulty electricity sub-station, half way through communications go down. A cache MISS may cause several minutes delay, perhaps longer, e.g., Which was the 10,000 volt cable?”

Is simple caching (LRU) enough??

Goals • Caches for distributed file systems, that operate across heterogeneous networks must • Provide the hit rates of conventional caches that operate over homogenous networks • Minimize communication overhead, i.e., minimize replacements which mean increased file availability and

Our Approach to Caching • File access patterns aren’t random • A semantic relationship exists between two files in a file access sequence • User behavior • Program execution • We define and investigate two kinds of such relations • Inter-file relations • Intra-file relations • We introduce the notion of eviction index for each cached item

Outline • Motivation • Caching • Semantic-based Caching • Simulation Results • Conclusion

Inter-file relations Analysis of DFS traces

Inter-file relations An inter-file relationship exits between two files i and j, if i is the next file opened following j being closed. File j is called file i’s precursor. Xi - represents the number of times file i is accessed. Ti - represents the time since the last access to file i. Yj - represents the number of times file j precedes file i.

Intra-file relations • An intra-file relationship is said to exist between two files i and j if they are both open before they are closed. • Intra-file relations are based on shared time Si,jdefined below • Where O(i) and C(i) are the time at which file i was opened or closed respectively O C i C O j Sij

Intra-file relations Ti - represents the time since the last access to file i. Tj - represents the time since the last access to file j where j is open before i is closed. Si,j - represents the shared time of file i with respect to file j where i is closed before j. Stotal - represents the total shared time with all files that are open before i is closed

Inter + Intra

Workload DFS Traces from CMU were utilized during the simulation

Implementation • Seven replacement algorithms • RR – Round Robin • LRU – Least recently used • LFU – Least frequently used • GDS – Greedy dual size • INTER – based only on inter-file relations • INTRA – based on intra-file relations • Both – based on both intra and inter file relations • Varying cache sizes • 10KB, 25KB, 50KB, 100KB, 500KB … • Seven traces • Simulator maintains a cache (hashlist), open list (list of currently open files), close list (list of files that are closed).

Structure of simulator

Simulator pseudocode

Outline • Motivation • Caching • Semantic-based Caching • Simulation Results • Conclusion

INTER/BOTH Hit rates of all algorithms

INTRA INTER/BOTH Replace attempts of all algorithms

Performance – DFS Traces

Performance

The Need For File System Tracing • Traces haven’t been collected periodically enough to reflect present day usage activity • Publicly available traces such as traces collected at the disk driver level or web proxy traces do not give us relevant information on file system workload.

Original Open Original Close Open Close System Call Table Trace Module New Open Data Logger System Call Interception USER SPACE KERNEL SPACE int fopen(char *name,char *mode) Standard Library

Analysis Summary • Most files were opened for less than a hundredth of a second • Majority of files are accessed only a few times. There is a small percentage of very popular files • Majority of files are less than 100KB in size. Large file can be very large (heavy tail) • Almost half the accesses repeat within a short period of initially occurring • File throughput has greatly increased due to presence of large files • Majority of files accessed have a unique predecessor

MIST traces – Hit Rates

MIST Traces – Files Replaced

MIST Traces –Byte Hit Rate

Summary • We have presented a semantic-based caching algorithm and shown that it performs better than conventional caching approaches in terms of hit ratio and byte hit ratio • We have also shown that it does this performing far fewer replacements • Compared to prevalent replacement strategies that ignore file relations and communication overhead, this approach would seem to better suit distributed file systems that operate across heterogeneous environments

Future Work • Collecting more state-of-the-art distributed file systems traces • Applying the cache replacement algorithm into a real wireless file system in computer-assisted surgery application • Investigating the idea into more general applications, such as mobile database access, etc.

Questions & Comments? weisong@wayne.edu http://mist.cs.wayne.edu

A Semantic-based Cache Replacement Algorithm for Mobile File Access

A Semantic-based Cache Replacement Algorithm for Mobile File Access

Presentation Transcript

Cache Replacement Policies

A Survey of Web Cache Replacement Strategies

Outperforming LRU with an Adaptive Replacement Cache Algorithm

Cache Replacement Scheme based on Back Propagation Neural Networks

Semantic Matchmaking Algorithm

Semantic File Systems

Least Popularity-per-Byte Replacement Algorithm for a Proxy Cache

ARC (Adaptive Replacement Cache)

File Access

Semantic Content-based Access To Hypervideo Databases

Role-based Access Control in a Mobile Environment

Page Replacement Algorithm

File Access

A Semantic Match Algorithm for Web Services Based on Improved Semantic Distance

Semantic Access: Semantic Interface for Querying Databases

A Case for MLP-Aware Cache Replacement

Cache Replacement Algorithm

Working Set-Based Access Control for Network File Systems

File Access