1 / 44

Control Theory in Log Processing Systems

This paper discusses the application of control theory in log processing systems. It covers topics such as controlling queue length, load balancing, system reliability, and lessons learned. The text explores the use of feedback control theory to manage system performance and stability in the face of disturbances. It also delves into the challenges of implementing control theory in software systems and suggests potential solutions and future directions.

dianelewis
Download Presentation

Control Theory in Log Processing Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Control Theory in Log Processing Systems Wei Xu (xuw@cs.berkeley.edu) UC Berkeley Joseph L. Hellerstein IBM T.J. Watson Research Center

  2. Outline • Data streams and log processing • Applying control theory • Controlling queue length • Load balancing • Lessons learned

  3. Introduction • Goal of our project • A tool • A testbed • Problem: data rate up to 1 TB a day • Distributed Infrastructure • How to make itself reliable?

  4. Example of system log data • request data • Apache log, etc • performance data • CPU, mem etc. • failure data • Detected problems /error messages • reports from operators

  5. raw logdata The big picture Production System Data Collection Automatic analysis preprocessing  ? Repository Sanitized Data  Failure Detection

  6. Preprocessing • Sanitize the data • Put logs into common format • Merge information from various sources • Sampling • Needs to be fast

  7. SLT query Q Stream processing • Log data are data streams • Preprocessing tasks are continuous queries • Telegraph Continuous Query (TCQ) • SQL queries • adaptive: execution optimized on-the-fly • performance doesn’t depend on #queries

  8. TCQquery Q TCQquery Q TCQquery R 6+5+4 3+2+1 4 1 4 1 TCQquery Q 6 5 4 3 2 1 5 2 5 2 6 6 5 5 4 4 3 3 2 2 1 1 6 3 6 3 Data preprocessing architecture load splitter combiner SLT 1 SLT 2 Intra-Event Processing Inter-Event Processing

  9. Problem: performance disturbance • CPU contention • Maintenance Tasks • Packets drop • Other failures SELECTIVITY changes

  10. The result of disturbance End to End Response time (ms) Time (second)

  11. Solution – Control Theory • Treat this as a failure? • Not necessary and too expensive • Feedback control theory as first tier defense mechanism • Try to make it stable at least for sometime • If doesn’t work out, try recovery

  12. Outline • Data streams and log processing • Applying control theory • Controlling queue length • Load balancing • Lessons learned

  13. The problem Source Buffer TCQ Result Q

  14. Why does this happen? TCQ Complex internal structure Controlled Data Source Input Buffer TCQ drops tuples silently if result queue is full Back pressure not possible

  15. Control Problems • Goal? • No dropping tuples • What to control? • The result queue length • The Knob? • Input data rate to the TCQ node

  16. Control block diagram Target system (System identification) u(k)=u(k-1)+(Kp+KI)e(k)-Kpe(k-1) Error Data rate in next interval Last Error Data rate in last interval

  17. Result – Under CPU Contention Source Buffer TCQ Result Q

  18. Why useful? • Original system • Input data rate =>tuple drop v.s. not drop • New system • Input data rate => Response time • Make it ready for load balancing

  19. Outline • System log as data streams • Applying control theory • Controlling queue length • Load balancing • Lessons learned

  20. The problem • Barrier in system • Different response times • End to end response time matches the slower node

  21. The control problem • Goal? • Make the response time equal • What to control? • Response time on each node • The knob? • Tuples assigned to each node • What to monitor? • Queue length v.s. response time

  22. System with control Response time

  23. Control block diagram

  24. Result End to End Response time (ms) Time (second)

  25. Outline • System log as data streams • Applying control theory • Controlling queue length • Load balancing • Lessons learned

  26. Advantages of control theory • Performance can be analyzed • Stability • Accuracy • Settling time • Overshoot

  27. Other advantages • Simple implementation • Encourage good system design • Modeling the system • Treat system as black box • First defense mechanism against disturbances in system

  28. Limitations • Not all software systems are designed to be controlled • Finite input produces unbounded output • E.g. Join in TCQ • Useful state not measurable • Queuing theory helps, but lacks other good theory • Many binary variables • Failed v.s working correctly

  29. Other Limitations • The model for target system is complex • Lack of a reliable knob • E.g. change result queue length of TCQ – sometime it crash • What is the range you can turn? • How often you can turn? • How long will the system respond? • Can not find the cause of problem

  30. Solution? • More advanced modeling and controller? • Adaptive control • Design controller-friendly systems? • A simple model • User configurable parameter -> knobs?

  31. Future Work • As a tool, real users? • Scheduling multiple streams • Dynamically scale up/down • Other control theory applications

  32. Backup Slides

  33. Future Work • Load balancer • Load control across multiple tiers • Scheduling of multiple streams

  34. Controlled Data Source Output Rate Controller Queue Length Monitor System with control

  35. Result Source Buffer TCQ Result Q

  36. Conclusion • Advantages of feedback control • Make system more robust under disturbance • Allows more time for failure detection • Treat complex systems as black boxes • Cope with the system characteristics instead of having to change it • Theoretical analysis • Implementation is easy • System statistics can also be used for SLT

  37. What is going on? Controlled Output Thread(Code Reuse) Desired Queue length Queue Length Controller Data Rate to TCQ Actual Queue Length

  38. Output Y from simulation Theory meets reality Queue length Time

  39. Tricky part of parameter estimation Model evaluation – Making the system operate in desired range Data rate vs free space Free Space Non-Linear range Easy for data source, but queue length ..

  40. Why do we need control? • Data source does not provide accurate data rate

  41. Control Problems • Not accurate for various reasons • Scheduling • Time spent on I/O • Etc. • Providing an accurate data source using feedback control • By controlling the input of “desired rate”

  42. PI Controller The Control Architecture 1500 1900 1600 P Controller (with precompensation) u(k)=Kp*e(k) U(k)=u(k-1)+(Kp+KI)e(k)-Kpe(k-1)

  43. Result – An accurate data source P Controller with Pre-compensation PI Controller

  44. Zoom In A lot of small disturbance in a Java program Incremental garbage collection P Controller PI Controller

More Related