SCAN-Lite: Enterprise-wide analysis on the cheap

SCAN-Lite:Enterprise-wide analysis on the cheap Craig Soules, Kimberly Keeton, Brad Morrey

Enterprise information management • Search • Clustering • Provenance • Classification • IT Trending • Virus scanning Metadata Server

Enterprise information management Metadata Server Data is duplicated across machines! Duplicate analysis is wasted work

Issues • Analysis programs conflict on clients • Contend for system resources (memory, disk) • Clients repeat work • Duplicate files on multiple clients • Client foreground workloads are impacted • Work exceeds available idle time on busy clients

Approaches • Reduce resource contention Client

Clients Approaches • Avoid duplicate work

Approaches • Leverage duplication to balance client load • Delay analysis to identify all duplicates Clients Global Scheduler

Solutions • Local scheduler • Coordinates analyses to reduce resource contention • Up to 60% improvement • Global scheduler • Identifies duplicates to remove work • Balance load • 40% reduction in impact to foreground tasks

Analysis Programs Local scheduling • Traditionally, analyses are separate programs • Scheduling left to the operating system • Potentially at different times • Each program identifies files to scan • Each program opens and reads file data Disk

Analysis Plugins Unified local scheduling • Each analysis routine is a separate thread • Control thread manages shared tasks • Identify files to scan, and open/read file data • Shared memory buffer distributes file data ControlThread Disk Shared Memory

Local scheduling performance • Ran a fitness test using 7 analysis routines • 42 data sets, each containing files of a fixed size • Ran both approaches over each data set • Calculated per-file elapsed scan time • Dual-core 2.8GHz P4 Xeon, 4GB RAM, 70GB RAID 1 • Seven-at-once • Run each analysis routine separately at the same time • Unified • SCAN-Lite’s unified local scheduling approach

App 1 App 2 Sum of CPU times Sum of elapsed times Max of elapsed times Elapsed time vs. CPU time • Original fitness test used CPU time • Gave less variable performance curves for modeling • Disk contention shows up in elapsed time • CPU time is multiplexed • Elapsed time is not This is very bad

Local scheduling results 17% - 60% improvement Seven-at-once benefits from deep disk queues, but this hurts foreground apps Small random I/Os have worse interaction than larger ones

Global scheduler • Two goals: • Reduce additional work from duplicate files • Utilize duplication to schedule work to the “best” client • Two-phase scanning • Phase one: identify duplicate files using content hashing • Phase two: analyze one copy at the appropriate client • Delaying between phase one and two provides opportunity for additional duplication and deletion

Traditional scanning Clients Server

Phase one: Duplicate detection Clients Server

Phase two: Scheduling Clients Server

Freshness When to schedule • Clients upload hashes each scheduling period • The freshness specifies a deadline by which new data must be analyzed Scheduling here gives one option Scheduling here gives three options Schedule before this period Scheduling Period Time

IdleTime A B C D Files Clients How to schedule • Scheduling is a bin packing problem • Files are balls, clients are bins • Size of bins is available idle time • Color of balls/bins equates to location of duplicates • Size of balls is time required for analysis

IdleTime A B C D Files Clients How to schedule • We use a greedy heuristic for scheduling • Consider idle time and machine priorities • See paper for details

IdleTime A B C D Files Clients Work ahead • Start by scheduling all work that meets freshness • Schedule additional work on still idle machines • Any remaining idle time can be used for additional work • We refer to this as work ahead

One-phase Cost Two-phase Cost Two-phase scanning: Trade-offs Clients

Two-phase scanning: Trade-offs • If cost of hashing exceeds the additional work from duplicates, then one-phase scanning is better • Analysis of hashing costs using SHA-1 indicate that 3% data duplication is the minimum • Do we see that in practice?

Data set 1 2+ Hash Duplication in enterprise data • Examined two data sources: • 100 user home directories from a central server • 12 user productivity machines • In both datasets, saw ~10% duplication • Even more with system files, email servers, sharepoints, etc. • This is sufficient duplication for work reduction = 4/7 duplication

Global scheduling policies • Traditional • One-phase scanning, scan all copies • Rand • Two-phase scanning, random scheduling • BestPlace • Two-phase scanning, greedy scheduling • BestPlaceTime • Two-phase scanning, greedy scheduling + work ahead • Opt • Unreplicated data only, delayed + work ahead

Client Impact TotalWork Idle Time Client Metrics • Total Work • Total elapsed time spent on analysis and hashing • Client Impact • Time spent that exceeded client idle time

Client Impact TotalWork Idle Time Client Metrics • Metrics calculated for each day • Summed over the entire simulation period

Experimental setup • Implemented a simulator to test a variety of machine configurations and scheduling policies • Config: 50 high priority blades, 50 low priority laptops • Blades were modeled after: • Dual-core 2.8GHz P4 Xeon, 4GB RAM, 70GB RAID 1 • Laptops were modeled after: • 2GHz Pentium M, 1.5GB RAM, 60GB SATA • Simulated 30 days • Daily creation rates and layouts from traced workloads • Freshness of 3 days, scheduling period of 1 day

Total work Prefers faster blade machines over laptops, increasing their total work to reduce client impact Removes duplicate work, reducing the total work done Doing work ahead of the freshness delay means analyzing files that would have been deleted

40% Improvement Client impact By doing work ahead of the freshness deadline, SCAN-Lite takes better advantage of idle time Choosing the best place helps hit the idle time targets, reducing average client impact Less work means less impact Theoretical OPT only 8% better than BestPlaceTime

Summary • Reducing local scanning interference is critical • 17% - 60% improvement from reduced contention • Two-phase scanning reduces analysis overheads • Reduce total work to near single-copy costs • Reduced client impact by up to 40% on our workload

Future work • This is an initial system for reducing analysis costs • Many improvements remain! • Vary freshness delays • Different applications may have different requirements • Provide freshness and scan priorities to clients • Could prioritize scan order to not exceed client idle times • Try more workloads • May need better bin packing algorithms

Summary • Ever increasing number of analyses in the enterprise • Search, provenance, trending, clustering, classification, etc. • Local scheduling to reduce resource contention on clients • Up to 60% performance improvement • Two-phase scanning to reduce work and balance load • Delay analysis work to identify duplicate work • Global scheduling to balance load • Reduced client impact by up to 40% on our workload

SCAN-Lite: Enterprise-wide analysis on the cheap

SCAN-Lite: Enterprise-wide analysis on the cheap

Presentation Transcript

EKG

Thyroid Scan

COMMERCIAL LAW

ENVIRONMENT OF BUSINESS-INDUSTRY

Introduction to Tobin Enterprise Land

SFCC Environmental Scan

Polygon Scan Conversion

DD* Lite: Efficient Incremental Search with State Dominance

BUSC 185-

Red Hat Enterprise Linux 5.0

Iowa’s Crash Data Analysis Resources Serving a Wide Spectrum of Users

Enterprise Ireland

The Librarian Infobutton Tailoring Environment (LITE)

A 型 (Amplitude mode)

Why Spectrum Analysis is Important

Getting Started with Enterprise Library 4.x in ASP.NET

Microcomputer Systems 1

Cheap Ticket Shop | Hotel, Car and Flight Booking Site