340 likes | 491 Views
SCAN-Lite: Enterprise-wide analysis on the cheap. Craig Soules, Kimberly Keeton, Brad Morrey. Enterprise information management. Search Clustering Provenance Classification IT Trending Virus scanning. Metadata Server. Enterprise information management. Metadata Server.
E N D
SCAN-Lite:Enterprise-wide analysis on the cheap Craig Soules, Kimberly Keeton, Brad Morrey
Enterprise information management • Search • Clustering • Provenance • Classification • IT Trending • Virus scanning Metadata Server
Enterprise information management Metadata Server Data is duplicated across machines! Duplicate analysis is wasted work
Issues • Analysis programs conflict on clients • Contend for system resources (memory, disk) • Clients repeat work • Duplicate files on multiple clients • Client foreground workloads are impacted • Work exceeds available idle time on busy clients
Approaches • Reduce resource contention Client
Clients Approaches • Avoid duplicate work
Approaches • Leverage duplication to balance client load • Delay analysis to identify all duplicates Clients Global Scheduler
Solutions • Local scheduler • Coordinates analyses to reduce resource contention • Up to 60% improvement • Global scheduler • Identifies duplicates to remove work • Balance load • 40% reduction in impact to foreground tasks
Analysis Programs Local scheduling • Traditionally, analyses are separate programs • Scheduling left to the operating system • Potentially at different times • Each program identifies files to scan • Each program opens and reads file data Disk
Analysis Plugins Unified local scheduling • Each analysis routine is a separate thread • Control thread manages shared tasks • Identify files to scan, and open/read file data • Shared memory buffer distributes file data ControlThread Disk Shared Memory
Local scheduling performance • Ran a fitness test using 7 analysis routines • 42 data sets, each containing files of a fixed size • Ran both approaches over each data set • Calculated per-file elapsed scan time • Dual-core 2.8GHz P4 Xeon, 4GB RAM, 70GB RAID 1 • Seven-at-once • Run each analysis routine separately at the same time • Unified • SCAN-Lite’s unified local scheduling approach
App 1 App 2 Sum of CPU times Sum of elapsed times Max of elapsed times Elapsed time vs. CPU time • Original fitness test used CPU time • Gave less variable performance curves for modeling • Disk contention shows up in elapsed time • CPU time is multiplexed • Elapsed time is not This is very bad
Local scheduling results 17% - 60% improvement Seven-at-once benefits from deep disk queues, but this hurts foreground apps Small random I/Os have worse interaction than larger ones
Global scheduler • Two goals: • Reduce additional work from duplicate files • Utilize duplication to schedule work to the “best” client • Two-phase scanning • Phase one: identify duplicate files using content hashing • Phase two: analyze one copy at the appropriate client • Delaying between phase one and two provides opportunity for additional duplication and deletion
Traditional scanning Clients Server
Phase one: Duplicate detection Clients Server
Phase two: Scheduling Clients Server
Freshness When to schedule • Clients upload hashes each scheduling period • The freshness specifies a deadline by which new data must be analyzed Scheduling here gives one option Scheduling here gives three options Schedule before this period Scheduling Period Time
IdleTime A B C D Files Clients How to schedule • Scheduling is a bin packing problem • Files are balls, clients are bins • Size of bins is available idle time • Color of balls/bins equates to location of duplicates • Size of balls is time required for analysis
IdleTime A B C D Files Clients How to schedule • We use a greedy heuristic for scheduling • Consider idle time and machine priorities • See paper for details
IdleTime A B C D Files Clients Work ahead • Start by scheduling all work that meets freshness • Schedule additional work on still idle machines • Any remaining idle time can be used for additional work • We refer to this as work ahead
One-phase Cost Two-phase Cost Two-phase scanning: Trade-offs Clients
One-phase Cost Two-phase Cost Two-phase scanning: Trade-offs Clients
Two-phase scanning: Trade-offs • If cost of hashing exceeds the additional work from duplicates, then one-phase scanning is better • Analysis of hashing costs using SHA-1 indicate that 3% data duplication is the minimum • Do we see that in practice?
Data set 1 2+ Hash Duplication in enterprise data • Examined two data sources: • 100 user home directories from a central server • 12 user productivity machines • In both datasets, saw ~10% duplication • Even more with system files, email servers, sharepoints, etc. • This is sufficient duplication for work reduction = 4/7 duplication
Global scheduling policies • Traditional • One-phase scanning, scan all copies • Rand • Two-phase scanning, random scheduling • BestPlace • Two-phase scanning, greedy scheduling • BestPlaceTime • Two-phase scanning, greedy scheduling + work ahead • Opt • Unreplicated data only, delayed + work ahead
Client Impact TotalWork Idle Time Client Metrics • Total Work • Total elapsed time spent on analysis and hashing • Client Impact • Time spent that exceeded client idle time
Client Impact TotalWork Idle Time Client Metrics • Metrics calculated for each day • Summed over the entire simulation period
Experimental setup • Implemented a simulator to test a variety of machine configurations and scheduling policies • Config: 50 high priority blades, 50 low priority laptops • Blades were modeled after: • Dual-core 2.8GHz P4 Xeon, 4GB RAM, 70GB RAID 1 • Laptops were modeled after: • 2GHz Pentium M, 1.5GB RAM, 60GB SATA • Simulated 30 days • Daily creation rates and layouts from traced workloads • Freshness of 3 days, scheduling period of 1 day
Total work Prefers faster blade machines over laptops, increasing their total work to reduce client impact Removes duplicate work, reducing the total work done Doing work ahead of the freshness delay means analyzing files that would have been deleted
40% Improvement Client impact By doing work ahead of the freshness deadline, SCAN-Lite takes better advantage of idle time Choosing the best place helps hit the idle time targets, reducing average client impact Less work means less impact Theoretical OPT only 8% better than BestPlaceTime
Summary • Reducing local scanning interference is critical • 17% - 60% improvement from reduced contention • Two-phase scanning reduces analysis overheads • Reduce total work to near single-copy costs • Reduced client impact by up to 40% on our workload
Future work • This is an initial system for reducing analysis costs • Many improvements remain! • Vary freshness delays • Different applications may have different requirements • Provide freshness and scan priorities to clients • Could prioritize scan order to not exceed client idle times • Try more workloads • May need better bin packing algorithms
Summary • Ever increasing number of analyses in the enterprise • Search, provenance, trending, clustering, classification, etc. • Local scheduling to reduce resource contention on clients • Up to 60% performance improvement • Two-phase scanning to reduce work and balance load • Delay analysis work to identify duplicate work • Global scheduling to balance load • Reduced client impact by up to 40% on our workload