280 likes | 437 Views
Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data. Rebecca Buchheit AIS Lab. Background. sporadic use of KDD techniques in civil infrastructure relative youth of data mining research difficult to systematically apply KDD process
E N D
Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data Rebecca Buchheit AIS Lab
Background • sporadic use of KDD techniques in civil infrastructure • relative youth of data mining research • difficult to systematically apply KDD process • KDD process tools (CRISP-DM) still under development • KDD process highly domain dependent • time consuming to teach data mining analysts domain knowledge
Research Objectives • develop a framework for systematically applying KDD process to civil infrastructure data analysis needs • set of guidelines for inexperienced analysts • checklist for more experienced analysts • describe intersection of KDD process characteristics and civil infrastructure • what problems are well-suited to KDD? • what characteristics are unique to infrastructure?
Summary • increased data collection => increased need to intelligently analyze data • KDD process as a “power tool” for analyzing data for high-level knowledge • civil infrastructure problems are well-suited to data mining but will need to apply entire KDD process to get good results • proposed framework will help researchers to systematically apply KDD process to their data analysis problems
Data Quality • What is it? • in this talk, “accuracy” • how close is the observed value to the true value? • “ground truth” is rare • look for anomalous patterns • Why is it important? • poor quality data may taint analyses • patterns of poor quality data may overwhelm data mining/machine learning algorithms
Mn/ROAD Data • weigh-in-motion data • axle spacings and weights, speed, lane, error codes • derived quantities • equivalent standard axle loads (ESALs) • FHWA vehicle type • gross vehicle weight • total vehicle length • trucks only (type >= 4) • Jan 1 ‘98 to Dec 31 ’00 • about 3 million vehicles courtesy Mn/ROAD
Overview of Approach • use statistical analysis and data mining algorithms to separate anomalies from normal data • clustering • regression • physical constraints • statistical properties • focus on differences between anomalies and normal data to help discover causation
Clustering • group data into “natural classes” • anomalies separated from normal data • used Autoclass clustering algorithm
Regression • confidence interval of 95% • R-square (fit) = 0.923 • if error > 15% then identify as anomaly ∑ ESAL = (3.531±0.176) ∑vehicles – (1.252±0.099) ∑axles + (0.066±0.003) ∑GVW – 139.000 ± 79.813
use a goodness-of-fit test to compare distributions from the same day of week length gross weight ESALs lane Distribution Constraints
Anomaly Identification • identify days with higher than normal concentrations of binary constraint violations • identify days that are not likely to have come from the baseline distributions for length, ESALs, gross weight and lane
A Quick Refresher • used four different procedures to detect anomalies • clustering • regression • binary (physical) constraints • distribution constraints • next up • what is causing the anomalies? • can we fix them?
What Happened? • two vehicles traveling slowly and close together (tailgating) may be recorded as a single vehicle • lightweight vehicles are tailgating cars • cars not supposed to be in database • mis-classified because of tailgating • this causes the “high” vehicle counts • very heavy vehicles are tailgating trucks • lane 1 (right-hand side) data is missing for all “low” vehicle count days
removed all tailgating cars lightweight short 2 or 3 axles error code “halved” all tailgating trucks very long very heavy more than 9 axles error code Can It Be Fixed? (1)
inserted lane 1 vehicles from same time period in 2000 “shifted” days to make sure day of week was constant Tuesday Sept 8 1998 => Tuesday Sept 5 2000 Can It Be Fixed? (2)
Summary • statistical analysis and data mining algorithms can be used to detect systematic anomalies in data • focus on differences between anomalies and normal data to discover differences • need domain knowledge to understand causation
Current Progress/Future Work • integrate algorithms into data quality assessment program == automation • physical constraints • distribution constraints • other statistical characteristics of data • clustering • regression, neural networks • will support infrastructure-related data collection activities • use algorithms to identify and “clean” anomalies
Acknowledgements • Minnesota Department of Transportation, especially Maggi Chalkline • based upon work supported by the National Science Foundation, under Grant Numbers 9987871 and DGE 9553380