190 likes | 208 Views
Learn how to ensure capacity planning, detect failures, and anticipate problems for optimal service maintenance. Discover common pitfalls and effective practices to enhance your service architecture and troubleshooting capabilities.
E N D
“Good Enough”Service Model and Description Randy Katz, Jeff Mogul,Giovanni Pacifici, John Sontag,Ion Stoica, Mark Verber
Common Tasks for Practitioners • Capacity planning and deployment • Detect failure and repair (monitor) • Anticipate problems and prevent (trend) These tasks are done while: • system components change • overall service architect changes
Common (not best) Practices • A “production service” is composed of a series of poorly understoodcomponents which provide one or more network services. • Components often don’t scale well • There is little insight into performance characteristics of these components. • Architectural descriptions are incomplete. There is only a vague understanding of how components interact with other components • Typically only out of date visio diagram of relationships between service components running in-house generated software • Often lack understanding of what other dependencies such as name resolution. • The more time that passes, the less people understand the service • Insufficient investment into scaling / availability • simple load balancing using a network device without a full plan • Rendezvous and state not thought out: see options on netscaler • Staff often lack systems background and understanding
Common Problems • Little insight in how the service scales • As a result the service typical is driven off the cliff repeatedly. • hardware is added based on what machine seemed to fail first which may or may not fix problems • Little insight is what components depend on others • Often times changing one component has adverse effect on other parts of the service. • Diagnosis is slow and hard • People are afraid to touch pieces
What’s Good Enough • A guide which would get within +/-30% to the number and ratio of servers to handle a given load • A definitive and accurate description of how components interact which would aid monitoring and debugging
Greybox Model of a Service The following would be need for a model: • Description of components • Instrumentation of basic machine resources • Instrumentation for component interfaces • Description of how components interact in the overall service • Traffic capture and replay facility • Performance thumbnails (graphs) describing behavior under various loads
General Component Description Protocol Protocol Req Req Rec Rec CPU Protocol Memory Req Disk Rec Network
Service Description Diagram Database SQL CPU Memory Disk Network SQLNET HTTP HTTP HTTP App1 Cache CPU Memory Disk Network CPU Memory Disk Network HTTP HTTP CPU Memory Disk Network
Instrumentation for Interfaces • Each interface will minimally capture: • Simple logging of request and responses with timestamps is sufficient • Recommended Additions • Global transaction ID which is passed through each interface which enables path based analysis • Integrated capture / replay ala RADlab’s liblog
Traffic Capture and Replay Facility • Ability to capture real world traffic (traces) • Ability to replay real world traffic at specified rates • Alternatively, a synthetic load generator
Basic Scaling Methodology • Apply increasing load to individual components with the components it depends on being sufficient responsive that internal resources are the bottleneck. • If internal resources aren’t consumed before the machine hits a bottleneck, investigate external components • Find where the component goes non-linear. Back that value off by 15% and call that component being “rated” for that workload. • Based on the number of requests components need to issue to fulfill a request, it should be quite easy to figure out how many of what sort of components will be required to service a specified number of requests.
Performance Thumbnails (1) • Components as a Transfer Function • FIXME: Insert graphs of resource stavation, deadlock, livelock
Performance Thumbnails (2) • Create “Expected” Overlay Graph • X axis is number of requests • Y axis is: • work performed • Resources used • Number of requests each downstream component • Generate in the production service graphs • X axis is time • Y axis is <see above> • Purpose • Gives operations team a good feel for what a component should look like
Service Description Markup Details • Specify Components Relationships Once • This should only need to be changed when the relationship between components change • But Permit Fault Isolated Service Units • Building Blocks • component(*) = wildcard, any component of this type • component(x) = variable substitution, in same service unit • component(specific) = only the specified service unit • Example cache(*) appserverA(*) http appserverA(x) coredb(x) sqlnet appserverA(pod1) blobserver(pod1) http dbmonitor(ops) coredb(*) http
Generate Configuration andEnforce Service Model! • Models Typically Don’t Stay Accurate • Model provides little benefit to the component developers, so updates often lags just like most “documentation” • People who need the models typically have to infer the model through path based analysis or network flow analysis • Infered models typical miss things • It’s already broken • Prevents drift! • On machines via kernel like ipfilter • In network • Changes won’t work unless the model is updated
Advanced Scaling • As individual components are stressed see how internal resources are consumed. • You might want to change hardware/software platforms if one resources is consumed before all others. • If you increase the number of machines and you don’t get a corresponding performance increase then something other than internal resources is the bottleneck • Testing with all the components might reveal emergent behaivor
Suggested Benefits • “Hints” which guide machine deployment • Ratio of machines for various functions • Number given an anticipated load • Provides insight to operators (mental model) • Filters for alert management • May permit auto-tuned monitors • Rough insight into bottleneck • Add machine failing to improve performance guides to more serious problem • Lowers the risk of breaking service with releasing an updated component • Can be used for first pass at IDS and protection
Experiment Manager • A system which would run each of the components through a load test to get initial scaling numbers • Use the initial scaling number to prioritize which combination of components should be first tested together. • Map the space generated by each of the component transfer functions to find an optimal configuration • Run for as long as time permits • Component ratio of input / output really interesting
Other Opportunities for Research • Measurement Methodology • Look at Margo’s work on micro benchmarks • Peak to average work-load • Logging / capture • Crisp description of inter-dependence • Map of cliffs • Exploration of making critical resources first class items (locks, etc) • More natural good/bad for SLT • How to capture state oriented bindings