130 likes | 577 Views
SVS Alert Performance Evaluation at NYPD. Rick Kjeldsen. Goals of Alert Performance Evaluation. Determine system-wide performance Hit Rate (HR) / False Positive Rate (FPR) Identify significantly under-performing cameras Identify areas for analytic improvement. Assumptions and Implications.
E N D
SVS Alert Performance Evaluation at NYPD Rick Kjeldsen
Goals of Alert Performance Evaluation • Determine system-wide performance • Hit Rate (HR) / False Positive Rate (FPR) • Identify significantly under-performing cameras • Identify areas for analytic improvement
Assumptions and Implications • Identifying accurate HR of every camera requires unrealistic effort • Eventually only a sample of cameras should be tested with staged events • Only the worst performing cameras can be easily identified • Tuning every camera for maximum HR requires unrealistic effort • Most cameras must operate with little or no tuning • If a camera has a low FPR it is better to have the alert running even if the HR is somewhat low, than not running at all. • We should only eliminate cameras which have a high FPR • Data (e.g. video) from testing will be saved by IBM and used for release certification • Cameras need not be re-tested unless more data is required for tuning or development • SVS is evolving rapidlyas lessons from tuning are incorporated. • Evaluation effort must decrease over time as the system matures and IBM/NYPD gain confidence in it
Test plan overview • Monitor all cameras for FPR • Tune worst performing cameras to reduce FPR • Eliminate cameras where high FPR can not be reduced • Test only a sample of cameras for HR • Use wild events where practical • Stage tests on a larger number of cameras with fewer drops each • Sampling frequency and method TBD • Sample size decreases over time • Identify and tune the worst performing cameras • Use results to inform algorithm development and proactive camera configuration
False Positive Rate Evaluation • NYPD labels TP vs FP • IBM • Assigns causes to all FPs • Tunes to reduce FPR where possible • Identifies specific problems, trends and patterns to guide analytic development • The current FP testing methodology gives good results but detailed adjudication is not scalable • Evolve toward doing detailed adjudication only on cameras with high FPR
Current Hit Rate Evaluation • NYPD • stages 10-20 events per camera • IBM • Monitors tests to obtain “live” Hit Rate • Captures test video for tuning, algorithm development and release certification • Evaluates / tunes every under-performing camera • Extensive lab testing • Cameras not meeting performance criteria are removed • The current method produces poor results and is labor intensive • Poor accuracy of performance metrics given low numbers of staged events • NYPD effort to stage events • IBM effort to repeatedly evaluate / tune every underperforming camera
Proposed HR Evaluation • Phase 1 (initial 25 alerting cameras): • Goal: Identify major classes of problems and remediations • Stage events on every alerting camera • 10-20 events per camera • Distributed across time of day, weather, etc. • For under-performing cameras: • Tune where possible using previously staged events • Record remediations to apply to future cameras with same attributes • Modify algorithms to eliminate underlying cause of poor performance as possible • Deploy inter-release algorithm patches
Proposed HR Evaluation • Phase 2 (next 25 cameras): • Goal: Verify remediations, continued algorithm development, reduced effort • Stage events on every alerting camera • 5 events / camera • For under-performing cameras (<3 hits): • Staged additional events (10-20 as needed) • Tune where possible • Record remediations to apply to future cameras with similar attributes • Modify algorithms to eliminate underlying cause of poor performance as possible
Proposed HR Evaluation • Phase 3 (all remaining cameras): • Goal: Monitor overall system performance, identify new problem camera classes • Stage events on subset of alerting cameras • 5 events / camera • For under-performing cameras (<3 hits): • Additional drops (10-20 as needed) • Tune if possible • Record remediations to apply to future cameras with similar attributes • Identify similar cameras in current set • Apply remediation • Adjust algorithms to eliminate underlying cause of poor performance as possible
Estimation of Detection Rate • System evaluation on all staged data on all cameras is our best estimate of detection rate. The margin of error for a small number of drops is too large to generalize. • Estimate may be off due to 1) Sampling errors - favorable or unfavorable conditions are over or under-represented. 2) Staging of event may not be “true to life”. 3) Evaluation system may not be configured optimally (tuning efforts on particular cameras may bring about better performance). • But we can use our best estimate to: • Track performance across system releases (find issues that can be addressed prior to deployment). • Find failure modes that occur on particular cameras during deployment without the need to stage a large number of events (see next slide).
Detection of Failure Modes Run a large-scale quantitative evaluation to form a best estimate of system detection rate. Example Detection Rate Estimate (425 Hits in 500 Drops) Assume an arbitrary camera is operating at least at the low value of the interval (in this example =0.809). 99% confidence interval (in 99 out of 100 experiments the true detection rate will be within this range) Given N drops, find the number of hits Hitmin such that if the system detects fewer than Hitmin bags, it is almost certain that the system is performing below (i.e., there may be a significant failure mode at work). The value of Hitmin in #3 can be found by summing up the probabilities in the binominal distribution for such that the sum is less than 0.01 0.95 0.9 0.891 0.85 0.809 1
Define a target detection rate • Given an experiment with n bag drops, Determine the minimum number of detections Dmin such that if fewer than Dmin bags are detected, then the system is most likely operating below the target detection rate . For example, if we decide on a target detection rate of 75% and we drop 20 bags and fewer than 10 bags are detected, then we are 99.6% confident that the system is actually performing below a 75% detection rate. It is important to note that if the experiment result is greater than or equal to Dmin bags detected, the target detection rate is not validated