680 likes | 854 Views
Automated Cellular Root Cause Analysis. Sayandeep Sen Bell Labs India Joint work with Sourjya Bhaumik & Rijin John . Cellular Base Station Monitoring. Every 15 minutes. Cell sites. Monitoring Centre. Cell site. Cellular Base Station Monitoring. Performance counters.
E N D
Automated Cellular Root Cause Analysis SayandeepSenBell Labs India Joint work with SourjyaBhaumik & RijinJohn
Cellular Base Station Monitoring Every 15 minutes Cell sites Monitoring Centre Cell site
Cellular Base Station Monitoring Performance counters Every 15 minutes Cell sites Monitoring Centre Performance counters Example: connected users, average signal strength, cell radius etc. Cell site
Cellular Base Station Monitoring KPI Every 15 minutes Cell sites Monitoring Centre KPI: Key Performance Indicator Example: Call drop rate, Successful connection setup rate, Throughput Cell site
Root cause analysis KPI KPI Cell sites Monitoring Centre Why KPI went below threshold ? Performance counters Cell site Manually
Root Cause Analysis – Issues Manual debugging is inefficient KPI • Too many variables • ~300 parameters • 1 engineer per O(100) cell sites Time Parameter 1 Time Parameter N Time
Root Cause Analysis – Issues Manual debugging is inefficient KPI • Too many variables • ~300 parameters • 1 engineer per O(100) cell sites Time Parameter 1 Sporadic parameter dips ??? Time Parameter N Time
Root Cause Analysis – Issues Manual debugging is inefficient KPI • Too many variables • ~300 parameters • 1 engineer per O(100) cell sites Time Parameter 1 Sporadic parameter dips Time Multiple parameter interaction Parameter N Time
Problem Statement Carry out automated (fast)root cause analysis which accounts for sporadic dips and multiple parameter interactions while ensuring human readable output.
Outline • Motivation • Problem statement • Approach • Insight, Mechanism, Customizations • Results • Ongoing work • Other work
Key Intuition KPI-parameter relationship is dependent on other parameter values
Key Intuition Call Success Conn. Req. Handoff rate
Key Intuition Call Success X Threshold y Conn. Req. Conn. Req. > X & H/o =y Handoff rate
Key Intuition Call Success KPI-parameter relationship is dependent on other parameter values Determine the rules for various parameter combination values using Regression trees X’ Conn. Req. y’ Conn. Req. > X’ & H/o =y’ Handoff rate
Outline • Motivation • Problem statement • Approach • Insight, Mechanism, Customizations • Results • Ongoing work • Other work
Regression trees Call Success Δ Form clusters of points Δ’ Δ” To minimize the sum of distance metric for sub-clusters
Regression trees Call Success Δ Form clusters of points Δ’ Δ” To minimize the sum of distance metric for sub-clusters Distance metric: sum of Euclidean distance of points in a sub-cluster Provide human readable rule for each cluster
Regression trees Call Success 1) Pick an axis 2) Calculate Δ Conn. Req.
Regression trees Call Success 1) Pick an axis X 2) Calculate Δ 3)Pick pivot to divide points in two clusters, Conn. Req.
Regression trees Call Success Δ” 1) Pick an axis X 2) Calculate Δ 3)Pick pivot to divide points in two clusters, Δ’ 4) Calculate Δ’+Δ” Conn. Req.
Regression trees Call Success 1) Pick an axis Repeat for all pivots X X X X 2) Calculate Δ 3)Pick pivot to divide points in two clusters, 4) Calculate Δ’+Δ” Conn. Req.
Regression trees Repeat for all axis Call Success 1) Pick an axis Repeat for all pivots 2) Calculate Δ 3)Pick pivot to divide points in two clusters, 4) Calculate Δ’+Δ” Conn. Req.
Regression trees Repeat for all axis Call Success 1) Pick an axis Repeat for all pivots X 2) Calculate Δ 3)Pick pivot to divide points in two clusters, 4) Calculate Δ’+Δ” 5) Pick pivot with minimum Δ’+Δ” Conn. Req. Conn.Req<X Conn.Req>=X
Regression trees Repeat for all axis Call Success 1) Pick an axis Repeat for all pivots X 2) Calculate Δ 3)Pick pivot to divide points in two clusters, 4) Calculate Δ’+Δ” 5) Pick pivot with minimum Δ’+Δ” Conn. Req. Conn.Req<X Conn.Req>=X Repeat for sub-clusters
Regression trees Repeat for all axis Call Success 1) Pick an axis Repeat for all pivots X 2) Calculate Δ 3)Pick pivot to divide points in two clusters, Handoff rate 4) Calculate Δ’+Δ” Y 5) Pick pivot with minimum Δ’+Δ” Conn. Req. Conn.Req<X Conn.Req>=X Repeat for sub-clusters Handoff Rate < Y Handoff Rate >= Y
Regression trees Repeat for all axis Call Success 1) Pick an axis Repeat for all pivots X 2) Calculate Δ 3)Pick pivot to divide points in two clusters, Handoff rate 4) Calculate Δ’+Δ” Y 5) Pick pivot with minimum Δ’+Δ” Conn. Req. Conn.Req<X Conn.Req>=X Repeat for sub-clusters Handoff Rate < Y Handoff Rate >= Y Select rules corresponding to low KPI values
Regression trees Call Success X Human readable Handoff rate Capture sporadic events due to time agnostic clustering Y Conn. Req. Conn.Req<X Conn.Req>=X Capture multiple variable interaction Handoff Rate < Y Handoff Rate >= Y
Outline • Motivation • Problem statement • Approach • Insight, Mechanism, Customizations • Results • Ongoing work • Other work
Regression trees – Issues • Distance metric oblivious of significance of KPI values • Curse of dimensionality
Metric oblivious KPI value significance Call Success Conn. Req. Need big separation between good and bad values Handoff rate
Metric oblivious KPI value significance Call Success Conn. Req. Bad 98.5% Call Success Handoff rate
Metric oblivious KPI value significance Call Success 98.7 % 98.5% 98.6% Conn. Req. Bad 98.5% Call Success Handoff rate
Metric oblivious KPI value significance Call Success Distinction between good and bad is small 98.7 % 98.5% 98.6% Conn. Req. Bad 98.5% Call Success Handoff rate Stratify KPI values
Metric oblivious KPI value significance Call Success Distinction between good and bad is small 98.7 % 98.5% 98.6% Conn. Req. Bad 98.5% Call Success Handoff rate Multiply KPI value with custom step function
Stratification of data Call Success Distinction between good and bad is small 98.7 % 98.5% 98.6% Conn. Req. Bad 98.5% Call Success Handoff rate Multiply KPI value with custom step function
Stratification of data Call Success Distinction between good and bad is small 98.7 % 98.5% 98.6% Conn. Req. Bad Call Success Handoff rate
Stratification of data Call Success Distinction between good and bad is small 98.7 % 98.5% 98.6% Conn. Req. Bad 98.5% Call Success Handoff rate
Regression trees – Issues • Distance metric oblivious of significance of KPI values • Stratify KPI values • Curse of dimensionality reduction
Curse of Dimensionality Call Success Handoff rate < X & Conn. Req. < Y Cell Radius > X & Allotted Power < Y Traffic Load Interference Traffic Load > X & Interference > Y
Curse of Dimensionality Call Success Handoff rate < X & Conn. Req. < Y ~300 variables lead to 2^300 combinationsregression tree can be misled Cell Radius > X & Allotted Power < Y Traffic Load Interference Traffic Load > X & Interference > Y
Dimensionality reduction • Preprocessing • Remove correlated, barely changing parameters etc. • Domain knowledge based filtering • Remove unrelated parameters, apply weights • Heuristics • Spike, Correlation, 3 more …
Spike heuristic Values spike around same time Call Success Time Time
Correlation heuristic Call Success > 98.5 % Call Success <= 98.5 % Call Success Call Success Conn. Req. Conn. Req. Correlation changes significantly
Rule generation Stratify KPI data Data store Apply filters Regression tree Select rules Rule store
Rule application Matching rules Rule store
Outline • Motivation • Problem statement • Approach • Insight, Mechanism, Customizations • Results • Ongoing work • Other work
Training & Verification Data • Analyzed 28 days of data from 217 cell sites • 2 countries, 2 OEMs • 317 parameters @ 15 minute interval • 80% data to train and 20% to validate
Find rules for all KPI dips Instances Instances KPI KPI Country #2(60 cell sites) Country #1 (18 cell sites) Cell sites with at least 4 KPIs with more than 100 bad instances selected
Rule Verification • Picked rules for randomly selected 50 KPI dips • Show rules to 15 RF engineers (Ongoing) • 80% rules were actionable • For all the KPI dips at least one actionablerule in the rule set
Example rule set KPI dip: Call success rate < 98.5% 1) Total users in 5 to 10 KM from base station > 63% 2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB 3) Download Traffic < 500 Kbytes AND Total active users < 200