1 / 21

Fault Detection

Fault Detection. Sathish S. Vadhiyar Source/Credits: From Referenced Papers. Introduction. Fine Grain Cycle Sharing (FGCS) Host computers allow guest jobs to utilize CPU cycles Availability of host computers vary Guest jobs may incur resource failures

brice
Download Presentation

Fault Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault Detection Sathish S. Vadhiyar Source/Credits: From Referenced Papers

  2. Introduction • Fine Grain Cycle Sharing (FGCS) • Host computers allow guest jobs to utilize CPU cycles • Availability of host computers vary • Guest jobs may incur resource failures • Need to predict availability of host computers • A scheduling system can allocate guest jobs based on the availability of host computers

  3. Kinds of Non Availabilities • FRC (Failures Caused by Resource Contention) • A guest job may significantly impact host processes • Hence a guest job can be removed • FRR (Failures Caused by Resource Revocation) • A machine owner suspends resource contribution without notice • Hardware-software failures occur

  4. Resource Failure Prediction • A multi-state failure model and application of a semi-Markov Process (SMP) to predict the temporal reliability • Predicting probability that no resource failure will occur on a machine in a future time window • Observing host resource usage values in a time window; calculating parameters of SMP based on host resource usage values

  5. Multi-state resource failure model • FRR – 2 states • A machine is either available or unavailable • FRC • Failures when host processes incur noticeable slowdown due to contention from guest processes • A host processor can first decrease the priority of guest processes; If this does not help, the guest process is terminated • Measured host resource usage as indicators of noticeable slowdown

  6. Initial Experiments • To study relations between host resource usage and FRC - Experiments conducted to simulate resource contentions between a guest process and host processes • Host-group – an aggregated set of host processes with various resource usages • Slowdown of host group – reduction of its CPU utilization due to contending guest process • Host programs are run with their isolated CPU usage between 10% and 100% • Guest process – a CPU bound program

  7. Experiments on CPU contention • Also measured reduction rate of host CPU usage for a host-group • Experiments repeated with different host groups with host priority 0, and guest priority 0 and 19 (renice) • Measured reduction rate plotted as function of isolated host CPU usage, LH • Found 2 thresholds for LH • Th1 – highest value of LH when guest process needs to be reniced to keep reduction rate below 5% • Th2 – highest value of LH when guest process needs to be suspended to keep reduction rate below 5%

  8. State model for LRC • 3 states • S1 - When LH < Th1; ignore resource contention due to guest processes; slowdown already less than 5% • S2 - When Th1 < LH < Th2; renice guest processes for slowdown to be < 5% • S3 - When LH > Th2; terminate guest process

  9. Experiments on CPU and Memory Contention • When memory trashing occurs • Total memory of guest and host processes exceed available memory size • Experiments were conducted to verify memory trashing does not depend on guest priority • S4 – for failure due to memory trashing

  10. Multi-State Failure Model • Proposed prediction algorithm is to predict the probability that a machine will never transfer to S3, S4, or S5 within a future time window • Transitions • Between S1, S2, S3 – decided by measured host CPU usage • To S4 – when memory is limited

  11. Semi-Markov Process Model (SMP) • Applicable when next transition depends only on • Current state • How long the system at the current state • Transition probabilities depend on amount of time elapsed since last change in state • SMP is defined by a 3-tuple • S – finite set of states • Q – state transition matrix • H – holding time mass function matrix

  12. SMP (Contd…) • The most important statistics of SMP - Interval transition probabilities, P • To calculate P • Continuous time SMP is expensive • Hence the work develops a discrete time SMP model

  13. SMP for Resource Availability • TR – probability of never transferring to S3, S4 or S5 within an arbitrary time window, W • Sinit – initial system state • W – Winit + T • Q and H calculated based on statistics from history logs due to monitoring host resource usage

  14. SMP for Resource Availability • Pi,j(m) = Pi,j(Winit, Winit+m) • P1i,k(l) – interval transition probabilities for a one-step transition • d – time unit of a discretization interval • Q and H calculated based on statistics from history logs due to monitoring host resource usage

  15. System Design and Implementation • Client requests job submission • Client’s job scheduler queries the gateways on available machines for temporal availabilities • Chooses a machine and spawns a guest job • During job execution, monitor detects state transition and notifies gateway • Gateway renices or kills the guest processes accordingly • Resource monitor uses simple cpu commands like `top’ to calculate cpu usages

  16. Computation in Solving SMP • Matrix sparsity in SMP is exploited to reduce computations • The sparse matrix is constructed based on 2 facts: • It takes a finite amount of time to transition from one state to another • S3, S4, S5 are unrecoverable failure states

  17. Prediction Accuracy TR gets close to 0 for large time windows

  18. Appropriate Training Size

  19. Comparison with Linear Regression Techniques

  20. Injecting Noises

  21. References • Resource Failure Prediction in Fine-Grained Cycle Sharing Systems. X. Ren, S. Lee, R. Eigenmann, S. Bagchi. HPDC 2006.

More Related