File Classification in self-* storage systems

File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer

Introduction • Self-* infrastructure need information about • Users • Applications • Policies • Not readily provided, and cannot depend on them to provide them • So? Must be learned

Self-* storage systems • Sub-problem of the self-* structure • Key: to get hints based on what creators associate with their files • File size • File names • Lifetimes • Intentions determined, then decisions can be made • Results: better file organization, performance

Classifying Files • Current: rule-of-thumb policy selection • Generic, not optimized • Better: distinguish classes • Finer grained policies • Ideally assigned at file creation • Determine classes at creation • Self-* must learn this association • 1) traces 2)running fs

So, how? • Create model that classify based on (some attribs) • Name • Owner • Permissions • Must filter out irrelevant attribs • Classifier must learn rules to do so • Based on test set • Then inference happens

The right model • Model must be • Scalable • Dynamic • Cost-sensitive (mis-prediction cost) • Interpretable (human) • Model selected: decision trees

ABLE • Attribute-based learning environment • 1. obtain traces • 2. make decision tree • 3. make predictions • Top down, until all attribs are used • Split sample until leaves have similar file attribs • After creation, query begins

Tests • Based on several systems to make sure it is workload-independent • DEAS03 • EECS03 • CAMPUS • LAB • The control: MODE algorithm – places all files in a single cluster

Results • Prediction results quite good • 90% - 100% claimed • Clustering files by attribs are clear • Predict that a model’s ruleset will converge over time

Benefits of incremental learning • Dynamically refines model as samples become available • Generally better than one-shot learners • Sometimes one-shot performs poorly • Ruleset of incremental learners are smaller

On accuracy • More attributes = chance of over-fitting • More rules -> smaller ratios • Loses compression benefits • Predictive models can have false predictions • Can impact performance • Things that should be in RAM is placed on disk instead etc. • Solution: cost functions • Penalize errors • Create biased tree • System goals will need to be translated into it

Conclusion • These trees provide prediction accuracies in the 90% range • Adaptable via incremental learning • Continued work: integration into self-* infrastructure

Questions?

File Classification in self-* storage systems

File Classification in self-* storage systems

Presentation Transcript

File Systems: Fundamentals

Chapter 11: Storage and File Structure

Storage Systems

CLOUD Computing FILE STORAGE SYSTEMS

Chapter 10

CSE 451: Operating Systems Winter 2010 Module 14 File Systems

File Systems and Storage

Operating Systems

CS582: Distributed Systems Lecture 13, 14 - October 14 and 20, 2003 File Systems

Chapter 10: Storage and File Structure

Effective Use of NERSC File Systems

File Systems

File Systems

Operating Systems Concepts I/O Systems Mass Storage Systems File System Management

Distributed File Systems

File systems: outline

Storage and File Structure

File Systems

Storage and File Structure

Chapter 11: Storage and File Structure