130 likes | 305 Views
File Classification in self-* storage systems. Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer. Introduction. Self-* infrastructure need information about Users Applications Policies Not readily provided, and cannot depend on them to provide them
E N D
File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer
Introduction • Self-* infrastructure need information about • Users • Applications • Policies • Not readily provided, and cannot depend on them to provide them • So? Must be learned
Self-* storage systems • Sub-problem of the self-* structure • Key: to get hints based on what creators associate with their files • File size • File names • Lifetimes • Intentions determined, then decisions can be made • Results: better file organization, performance
Classifying Files • Current: rule-of-thumb policy selection • Generic, not optimized • Better: distinguish classes • Finer grained policies • Ideally assigned at file creation • Determine classes at creation • Self-* must learn this association • 1) traces 2)running fs
So, how? • Create model that classify based on (some attribs) • Name • Owner • Permissions • Must filter out irrelevant attribs • Classifier must learn rules to do so • Based on test set • Then inference happens
The right model • Model must be • Scalable • Dynamic • Cost-sensitive (mis-prediction cost) • Interpretable (human) • Model selected: decision trees
ABLE • Attribute-based learning environment • 1. obtain traces • 2. make decision tree • 3. make predictions • Top down, until all attribs are used • Split sample until leaves have similar file attribs • After creation, query begins
Tests • Based on several systems to make sure it is workload-independent • DEAS03 • EECS03 • CAMPUS • LAB • The control: MODE algorithm – places all files in a single cluster
Results • Prediction results quite good • 90% - 100% claimed • Clustering files by attribs are clear • Predict that a model’s ruleset will converge over time
Benefits of incremental learning • Dynamically refines model as samples become available • Generally better than one-shot learners • Sometimes one-shot performs poorly • Ruleset of incremental learners are smaller
On accuracy • More attributes = chance of over-fitting • More rules -> smaller ratios • Loses compression benefits • Predictive models can have false predictions • Can impact performance • Things that should be in RAM is placed on disk instead etc. • Solution: cost functions • Penalize errors • Create biased tree • System goals will need to be translated into it
Conclusion • These trees provide prediction accuracies in the 90% range • Adaptable via incremental learning • Continued work: integration into self-* infrastructure