250 likes | 621 Views
Towards Pre-Deployment Detection of Performance Failures in Cloud Systems. Riza Suminto , Agung Laksono * , Anang Satria * , Thanh Do † , Haryadi Gunawi. *. †. Cloud Systems. Demands. U sers demand high dependability, reliability, and performance stability
E N D
Towards Pre-Deployment DetectionofPerformance Failuresin Cloud Systems Riza Suminto, AgungLaksono*, AnangSatria*,Thanh Do†, HaryadiGunawi * †
SPV @ HotCloud ’15 Cloud Systems
SPV @ HotCloud ’15 Demands • Users demand high dependability, reliability, and performance stability • Amazon found that every 100ms of latency cost them 1% in sales • Google found an extra 0.5second in search page generation time dropped traffic by 20% Speed Matters!
SPV @ HotCloud ’15 Performance failures happen What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems, SOCC’14 22%
SPV @ HotCloud ’15 Outline PerformanceBug System PerformanceVerifier
SPV @ HotCloud ’15 Performance Bug • Jobs take multiple times than usual to finish • Improper speculative execution JCH1& TPL1 & FPL2 & FTY1 • Unnecessary repeated recovery TPL1& TPL4 & FTY4 & TOP1
SPV @ HotCloud ’15 UntriggeredSpecExec Map read locally Mappers and reducersin different nodes All-to-All Fault at map node Slow NIC Mappers Reducers DLCA TPLA M1 JCHA M2 slow! FPLA M3 FTYA All reducers slow! No straggler = No SpecExec DLCA& TPLA & JCHA& FPLA& FTYA
SPV @ HotCloud ’15 UntriggeredSpecExec, cont • DLCA & TPLA & JCHA & FPLA & FTYA • DLCA & TPLA & JCHA & FPLA & FTYA Mappers Reducers DLCB = read remote M1 DN Mappers M1 M2 M2 Straggler! M3 M3
SPV @ HotCloud ’15 UntriggeredSpecExec, cont • DLCA & TPLA & JCHA & FPLA & FTYA • DLCA & TPLA & JCHA & FPLA & FTYA Mappers Reducers slow reducer = FPLB M1 Mappers Reducers M2 Straggler! M1 M2 M3 M3
SPV @ HotCloud ’15 O(n) Recovery Mappers and Reducersin different nodes Mappers and Reducersin different racks Large number of nodes per rack Slow inter-rack switch TPLA M M TPLB R M M slow! TOPA M FTYB Rack 1 Rack 2 TPLA& TPLB& TOPA& FTYB
SPV @ HotCloud ’15 Conditions lead to performance bug • Untriggered Speculative Execution • MR-70001 = JCH1& TPL1& FPL2& FTY1 • MR-70002 = DSR1& DLC1& FPL1& FTY1 • MR-5533 = FTY2 & FPL3 & TPL3 • … • O(n) Recovery • MR-5251 = FTY3 & FPL3 & FTM1 • MR-5060 = TPL1 & TPL3 & FTY1 & FPL2 • MR-1800 = TPL1 & TPL4 & FTY4 & TOP1 • … • Long lock contention • MR-9191 = FTY3 & FPL3 & FTM1 • MR-9292 = TPL1 & TPL3 & FTY1 & FPL2 • MR-9393 = TPL1 & TPL4 & FTY4 & TOP1 • …
SPV @ HotCloud ’15 Outline PerformanceBug System PerformanceVerifier
SPV @ HotCloud ’15 Current Approach • Benchmarking • Hundreds benchmark for every scenario • Injecting slowdowns and failures • Take days to weeks!!
SPV @ HotCloud ’15 What we want… • Four goals in performance verification • Fast • Covers many deployment scenario • Runs in pre-deployment • Directly checks implementation code • Formal modeling tools!
SPV @ HotCloud ’15 System Performance Verifier (SPV) • Hand model • 20X larger than • hand model @Data publicclassJobInProgress { JobIDjobId; TaskInProgressmaps[]; ... } @IO publicHeartbeatResponse heartbeat (HeartbeatDatahd){ ... } • SPV Compiler • Target system • (e.g., Hadoop code) • Auto-generated model(in Colored Petri Net) • PerformanceVerification
SPV @ HotCloud ’15 Colored Petri Nets (CPN) Tasks (“T1”,map) task @+10 A @0 (A,“T1”,map) @10 Node Task to Run Schedule Task assignment node input(node,task);output(assignment); action let val (id,type) = task in (node,id,type) end;
SPV @ HotCloud ’15 Challenges : Two Different World CPN Java
SPV @ HotCloud ’15 Our Approach • Java SysJava • Data flattening • Code modularization • Annotation tagging • SysJava Model compiler
SPV @ HotCloud ’15 Data Flattening • Java system states = ArrayList, Map, Tree,… • CPN states = multisets [(1)] List<JobInProgress> runningJobs; publicclassJobInProgress { JobIDjobId; TaskInProgressmaps[]; ... } classTaskInProgress{ TaskIDid; doubleprogress; ... } Job In Progress [(1,a),(1,b)] Job Task Mapping [(a,10%),(b,15%)] Task In Progress
SPV @ HotCloud ’15 Code Modularization Modular function @ProcessState privatevoid initCheck() { synchronized (taskTrackers) { ... } } privatebooleanprocessHeartbeat( TaskTrackerStatustrackerStats) { synchronized (taskTrackers) { ... } for (TaskStatusts: trackerStats) { tasks.get(ts.id).updateStatus(ts); } ... } Control Flow logic @ForEach privatevoidupdateStatuses( TaskTrackerStatustrackerStats) { for (TaskStatusts: trackerStats) { ... } } CRUD Logic @GetState privateTaskInProgressgetTask(TaskID id) { tasks.get(ts.id); } @UpdateState privatevoidtipUpdate(TaskInProgresstip, TaskStatusts) { tip.updateStatus(ts); }
SPV @ HotCloud ’15 Annotation Tagging • Assist compiler • Annotation Category: • Data Structure • I/O • CRUD & Process • Miscellaneous @Data publicclassJobInProgress { JobIDjobId; TaskInProgressmaps[]; ... } @IO publicHeartbeatResponse heartbeat (HeartbeatDatahd) { ... }
SPV @ HotCloud ’15 Model Checking • SPV Compiler Executable XML • Define configurations, assertions, and specifications • Explore every non-deterministic choices • Task to node mapping (“T1”,map) (“T1”,map) Tasks Tasks A A B B Schedule Task Schedule Task Node Node (A,“T1”,map) (B,“T1”,map) Task to Run Task to Run T1 on A T1 on B
SPV @ HotCloud ’15 Preliminary Result • 5305lines of code on top of WALA & Access/CPN • HadoopMapReduce 1.2.1, with 1067lines code change • 20xlarger than hand-made model • 34scenario, 30assertion violation, 4 performance bug • 1.5hour model checking
SPV @ HotCloud ’15 Thank you!Questions? http://ucare.cs.uchicago.edu
SPV @ HotCloud ’15 Discussion • Is it time for pre-deployment detection of performance bugs? • Bridging system code and formal methods • Future of data-centric languages • Beyond Hadoop • Root cause anatomy of performance bugs • Beyond performance bugs