200 likes | 320 Views
Dealing with Data Quality. Google Workshop July 24, 2009. ?. Low light. Blurry. Missing. Blurry. Faults can reduce the quantity and quality of the collected information. When ignored, faults in a dataset can lead to ambiguous, or worse, incorrect conclusions. “Circle”. “Circle”.
E N D
Dealing with Data Quality Google Workshop July 24, 2009
? Low light Blurry Missing Blurry
Faults can reduce the quantity and quality of the collected information.
When ignored, faults in a dataset can lead to ambiguous, or worse, incorrect conclusions. “Circle” “Circle” “Square” “Square” “Square” “Square” “Square” “Circle” “Square”
Unfortunately faults in networked sensing systems are common Good Data Network Faults Data Faults *** Numbers are approximations based on publications, personal communications 1 R. Szewczyk et. al. An analysis of a large scale habitat monitoring application. In Procs. Sensys, 2004. 2 G. Tolle et. al. A macroscope in the redwoods. In Proc. SenSys, 2005. 3 G. Werner-Allen et. al. Fidelity and Yield in a Volcano Monitoring Sensor Network. In Procs. OSDI, 2006. 4 Cms database. http://cens.jamesreserve.edu/phpmyadmin
Our experience is similar: Almost 60% of data was faulty in this soil deployment (Bangladesh, 2006) Ammonium Calcium Carbonate Chloride Nitrate pH
Many methods to find faults Examples include • Visual inspection • Manual validation • Analytical validation: statistical, scientific models Temperature Depth Statistical, e.g. outlier detection Scientific, e.g. “temperature decreases with depth”
Several methods to fix faults • Go into the field and replace or fix the problem. • Remove the faulty data, (“clean” the dataset), after the deployment is over.
Faults persist for a number of reasons, including: First, faults can be difficult to define and identify
Faults persist partly because they are difficult to define X
Faults persist partly because they are difficult to define A nitrate deployment in the riverbed of Merced river
Faults persist partly because they are difficult to define A nitrate deployment in the riverbed of Merced river
Faults persist partly because they are difficult to define Nitrate data taken from nearby locations A nitrate deployment in the riverbed of Merced river Which one is correct? Are the both correct? Are they both faulty?
Faults persist for a number of reasons, including: First, faults can be difficult to define and identify Second, faults are not always worth fixing
Not all faults need to be fixed [Schoellhammer ‘08] Maintenance can be expensive And, if the analysis can happen without the faulty data, then what’s the point? Temperature Temperature Depth Depth
Faults persist for a number of reasons, including: First, faults can be difficult to define and identify Second, faults are not always worth fixing Answering these questions is hard
Incomplete, ad-hoc, or last minute solutions for addressing faults only exacerbates the problem. Regardless of the solution for addressing faults - and there are many – it should be incorporated into the design and implementation of the system right from the beginning.
Nithya Ramanathan Thank You
Collecting usable sensor data from a networked system is never easy. Whether the data consists of images or nitrate levels from a chemistry sensor, faults can reduce the quantity and quality of the collected information. And when ignored, faults in a dataset can lead to ambiguous, or worse, incorrect conclusions. Unfortunately faults in networked sensing systems are painfully common. Faults persist partly because they are difficult to define, and even once identified, they are not always worth fixing. Incomplete, ad-hoc, or last minute solutions for addressing faults only exacerbates the problem. Regardless of the solution for addressing faults - and there are many - it should be incorporated into the design and implementation of the system right from the beginning.