400 likes | 538 Views
PORTAL Data Quality & Aggregation . Sue Ahn, Kristin Tufte. Road Map. Initial Data Cleaning (more cleaning later) Set null values as appropriate Set Data Quality Flags based on row-level data integrity violations Aggregate Calculations Calculate 5-minute aggregates
E N D
PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte
Road Map • Initial Data Cleaning (more cleaning later) • Set null values as appropriate • Set Data Quality Flags based on row-level data integrity violations • Aggregate Calculations • Calculate 5-minute aggregates • Calculate 15-min, 1-hour aggregates • Portal Output Calculations • Timeseries • Grouped Data • Performance
Road Map: Initial Data Cleaning • Set NULL values • Data Quality Flags
Set Null Values as Appropriate • Case 1: No Data Received from Detector • ODOT indicates vol = occ = speed = -1, status = 0 means no data received from detector ** (status 0 is inhibited) • We set volume, occupancy and speed to NULL in this case • Case 2: No Activity at Detector • ODOT indicates vol = occ = speed = 0, status = 0 means no activity at detector ** • We set speed to NULL in this case • Case 3: Disabled (?) • Also get readings: vol = occ = speed = -1, status = 1 (status 1 is disabled) • Similar to Case 1, except status = 1, so assume this does mean disabled • We set volume, occupancy and speed to NULL in this case • Should verify the meaning of vol = occ = speed = -1, status = 1 with ODOT • ** as of 2-14-05; fields were different before • No -999’s in data received live; there are -999s in 2001, 2002 data
Data Quality Flags • Detect data inconsistencies and erroneous values • Row-level/Reading-level flags • Flags are set for each (vol, occ, speed) reading we get • For each reading (row), a flag is set to indicate if that row fails a particular criteria • Examples • Max Speed, set if speed > 100 mph • Max Occ, set if occ > 95% • Adapted from TTI (see wiki for link)
Data Quality Flags – Single Value Flags • Maximum Volume (DQ_MAXVOL) • Condition: volume > 17 (20-sec data) • Comments: Equivalent to 3060 vph; may be caused by motorcycles or lane changes for example • Maximum occupancy (DQ_MAXOCC) • Condition: occupancy > 95% • Comments: detectors may be “stuck-on” • Maximum speed (DQ_MAXSPD) • Condition: speed > 100 mph • Comments: based on empirical evidence • Minimum speed (DQ_MINSPD) • Condition: speed < 5 mph • Comments: based on empirical evidence
Data Quality Flags – Multi-variate Consistency • Multi-variate consistency #1 (DQ_MVC_S0VGT0) • Speed = 0 and Volume > 0 • Indicates potential problem with: speed • Is this necessary? We already eliminate speed < 5 mph • Multi-variate consistency #2 (DQ_MVC_SGT0V0) • Speed > 0 and Volume = 0 • Indicates potential problem with: speed • Multi-variate consistency #3 (DQ_MVC_OGT0V0) • Occupancy > 0 and Volume = 0 • Note: alternative: set occupancy = 0 • Indicates potential problem with: occupancy
Road Map: Aggregate Calculations • 5-minute aggregates • 15-min, 1-hr aggregates
Missing Values in 5-min Aggregates of Vol, Speed, Occ • Raw Aggregates • Sum/avg data with no imputation, rescaling or cleaning • Processed Aggregates • Two issues: • #1 Is there enough data to calculate an aggregate? • #2 If there is enough data, how to handle the fact that we are missing values?
Missing Values in 5-min Aggregates of Vol, Speed, Occ, Cont’d • Issue #1: Determining if a time period has enough valid data to calculate aggregates for the time period • Aggregate vol, speed, occ set to null if less than X% of the readings are valid • Do sensitivity analysis to pick X (start with 75% - set 5-min aggregate to null if we have less than 75% of the data)
Bad Data Threshold • Rob - In general, throughout the development of the aggregation process, we consistently encountered the issue of how to determine when there is enough data to create an aggregate vs. when an aggregate value should be declared null because of lack of data. • For starters, we’ve chosen a threshold of 75%. If 75% or more of the 20-second readings in a 5-minute interval are valid, we calculate an aggregate for that 5-minute interval. • The same issue arises in the following situations: • Calculating 5-min aggregates from 20-second data • Calculating 15-min aggregates from 5-min data • Calculating 1-hr aggregates from 15-min data • Grouped Data: grouping over multiple days (group by hour) • Grouped Data: aggregating 1-hr data to daily data for group by day of week and week of year • Related issues: • Aggregating over lanes (how to handle the situation when data is missing from one lane, note that data may be missing from a lane for only a portion of a day) • We wonder if studying this percentage (data sensitivity) would make a good paper topic.
#2 Handling Missing Values • Two basic methods for dealing with missing values • 1) Use preceding/following values (replace missing volume with preceding volume reading for that detector) • 2) Aggregate values in a 5 minute range and scale based on # of readings • Since we are aggregating, suggest method 2 • We are aggregating, so interpolation may not be necessary • Think 2 will perform similarly to 1 in cases when only a few readings are missing, when lots of readings missing, should report that data is not available (null) • Method 2 is well-defined (simple to define and express/explain) • Method 2 is easier to implement than 1
5-min Aggregates (vol, spd, occ) • Volume • For each detector, sum all valid volume readings in 5-minute period (eliminate any readings with non-zero flags) and rescale by # of missing readings • Speed, Occupancy • For each detector, calculate the volume-weighted average of all valid readings in a 5-minute period • Assume 20-second readings are space-mean speed • Weight by volume (to be technically correct, we need to weight by density, but that requires estimating density)
5-Min Aggregate – Other Measures • Vehicle Miles Traveled (VMT) • 5-min VMT = 5-min volume * length of midpoint influence area of associated detector • Vehicle Hours Traveled (VHT) • 5-min VHT = 5-min volume * (length of midpoint influence area / 5-min space-mean speed) • Travel Time • 5-min travel time = length of midpoint influence area / (5-min space-mean speed)
5-min Aggregates – Other – Cont’d • Total Delay (in veh-mins) • 5-min delay = 5-min volume * (travel time – (influence area / free-flow speed)) • Where: free-flow speed = min {assumed free-flow speed, actual 5-min speed} • Note: Free-flow speed is computed in this way in order to avoid negative delay • Average Delay/Vehicle (in minutes) • 5-min delay = Total Delay/5-min volume • Should we report Average or Total Delay or both?
Summary: Which values/measures affect which calculations? 20-sec volume 5-min volume 5-min VMT * 20-sec speed 5-min speed 5-min VHT 5-min occupancy 5-min Travel Time 20-sec occupancy ** 5-min Delay * May be needed, depends on method used ** Need 5-min speed to calculate free-flow speed
Road Map: 15-min, 1-hr Aggregates • Issues: • Calculated from 5-min aggregates or from 20-second data? • We chose to use 5-min data • What should the threshold be for % of bad data? • Chose to toss 15-min reading if we are missing more than one 5-min reading • For now, toss 1 hr reading if we have less than 75% of 5-min readings (or less than 9 readings) Note the “bad data” threshold issue arises here, also see next page.
Aggregation over Larger Intervals • 15-min aggregates are calculated from 5-min aggregates, set aggregate value to null if missing more than one 5-minute reading • 1-hr aggregates are calculated from 5-min aggregates, set aggregate value to null if missing more than three 5-minute readings
Aggregation over Time, Cont’d • Volume • Sum 5-min volumes • Occupancy • Volume-weighted average of 5-min occupancies • Speed • Volume-weighted average of 5-min speeds • Vehicle Miles Traveled • sum of 5-min VMTs • Vehicle Hours Traveled • sum of 5-min VHTs • Travel Time • Volume-weighted average of 5-min travel times • Delay • Total Delay: Sum of 5-min delays • Average Delay: Volume-weighted average of 5-min delays
Road Map: PORTAL Output • We have described how to do the 5-min, 15-min and 1-hr aggregations • For each page in PORTAL, analyze what calculations are needed and how those calculations should be done
Timeseries – Line Graphs • Timeseries • Options: graph by station or lane, 7 basic measures, 5-min, 15-min, 1-hr granularity of input data • For Lane Graph • These values are in aggregate table (calculation described in previous slides) • Requires 5-min (15-min, 1-hr) aggregates of each value for each detector • If data is missing, put blank space in the graph • For Station Graph: Requires aggregation over lanes • Calculation described in subsequent slides • If data is missing from one or more lanes, show blank space on graph
Timeseries – Contour Plots • Contour Plot (Surface Plot?) for a Highway • Requires same calculations as for station plot on previous slide
Aggregation over Lanes • Volume • Sum • Occupancy • Volume-weighted average of by-lane occupancies • Speed • Volume-weighted average of by-lane speeds • Vehicle Miles Traveled • Sum over lanes • Vehicle Hours Traveled • Sum over lanes • Travel Time • Volume-weighted average of by-lane travel times • Average Delay • Volume-weighted average of by-lane delays • Total Delay • Sum of by-lane delays
Grouped Data – Line Plots • Line Plots by Station/Detector • Group by: 5 minute (minute of day) • Input is 5-minute aggregations • Simple arithmetic average, min, max, and std dev of data from appropriate 5-min period • Result shows variation between days • Group by: Hour of day • Input is 1-hr aggregations • As above, simple arithmetic average, min, max and std dev of data from appropriate 1-hr periods • Result shows variation between days What to do if we are missing data? What if we are missing 50% of the values for 8:00-9:00, for example? We may be missing different %s for different hour intervals??
Grouped Data – Line Plots, Cont’d • Line Plots by Station/Detector • Group by: Day of Week • Input must be 1-hr aggregations • Aggregate 1-hr data to daily data • What if one or more hours in a day is/are missing?? (What if we are missing a peak hour?) • Then simple arithmetic avg, min, max, std dev of aggregates • Result shows variation within day of week • Group by: Week of Year • Similar to day of week Note: “bad data threshold” issue again. Seems reasonable to rescale when going from 20-sec data to 5-min data, and again 5-min to 1-hr seems OK. But neither interpolating or rescaling seem really acceptable in this situation.
Aggregation to Day of Week • Volume • Sum 1-hr volumes • Occupancy • ?? Volume-weighted average of 5-min occupancies • Speed • ( 1-hr VMTs)/( 1-hr VHTs) • ensure for VMT and VHT - either are both null or neither are null • Vehicle Miles Traveled • 1-hr VMTs • Rescale??? • Vehicle Hours Traveled • 1-hr VHTs • Travel Time • Volume-weighted average of 5-min travel times • Delay • Total Delay: Sum of 5-min delays • Average Delay: Volume-weighted average of 5-min delays
WARNING: STOP HERE • Following slides are under construction (hard hats required)
Grouped Data – Highway Plots • Highway Plots • What does this do anyway??? • Group by: 5 minute (minute of day) • Requires input to be 5-minute aggregate data • Need to group 5-minute periods from different days together • Use appropriate aggregation-by-time method • Group by: Hour of day • Input can be 5-min, 15-min or 1-hr • For 1hr input need to group 1-hr periods from different days together • Use appropriate aggregation-by-time method • 5-min, 15-min input – Should we restrict this to 1-hr input?? • Just use appropriate aggregation-by-time method??
Grouped Data – Highway Plots, Cont’d • Highway Plots • Group by: Day of Week • Input is 5-min, 15-min, 1-hr • Need to group 5-min data into 5-min groups, 15-min data into 15-min groups, etc. • Use appropriate aggregation by time method?? • Group by: Week of Year • Same as Day of Week, I think
Grouped Data – All Highways • All Highways Plots
Performance Reports • MONTHLY FIGURESAvg TT Avg TS Avg VS FFTT FFTS 95th% TT • 95th% TS • 166.11 Mins 52.05 MPH 53.79 MPH 145.81 Mins 59.29 MPH 232.16 Mins • 37.24 MPH • VMTVHT% Cngst Hwy Len Days Used • 211,298,304 4,118,289 15.09% 144.09 Miles 31 • WEATHER NUMBER OF DAYS Rain ≥ 0.5" Rain ≥ 1.0" Frzn Precip Fog • 8 2 0 1 • TIME WINDOW FIGURES (primarily for daily report – BI, TTI are new) • Avg TT Avg TS 95th% TT 95th% TS • 166.11 Mins 52.05 MPH 232.16 Mins 37.24 MPH • VMT VHT % Cngst TTI BI • 211,298,304 4,118,289 15.09% 1.59 1.97
Future Flags • Calibration Error (not in this presentation, see paper) • Multi-variate consistency • Volume > nmax when Occupancy = 0 • This should be tested after correcting for calibration error • Maximum reduction in speed • Speedn+1 > 0.45*Speedn • Identical records • If more than 8 consecutive records (volume, occupancy, and speed) are identical
Questions • No vehicles present Speed = 0 when Volume = 0Set Speed = null • Am not doing this because of the -1, -1, -1, and 0, 0, 0 tests • At what point do we call a 5-minute interval invalid?? 5 missing readings? 8? 9? 10? • Aggregation over space – we need something for vol and • Aggregation over space – need to deal with missing data • Need aggregation over time (i.e. combining data from different days together)
ODOT SWARM Flags • 0 - inhibited • 1 - disabled • 2 - ok • 3 - suspect • 4 - softfail • 5 - hardfail
Data Notes • Status 0 • Received Data As of Feb 14, 2005, ODOT reports vol -1, occ -1, speed -1, status 0, for detectors with no controller communications (no data received from detector), and vol 0, occ 0, speed 0, status 0 for detectors reporting no traffic data (no activity) (info from David Gaarsoe). See Data Fidelity and minus 1 and minus 999 for more info. • Data Cleaning The nightly scripts "clean" the -1 and -999 data as follows: • Sets volume = 0, speed = NULL, occupancy = 0 where volume = -999 • Sets volume = NULL, speed = NULL, occupancy = NULL, where volume = -1 • I don’t belive we receive -999 data anymore (KT) • Unknown • We seem to also get vol = 0, speed = 0, status = 2 (or other status), with occ > 0. What does this mean? • What was that weird thing I noted in the file on badger?
Steve H’s notes from emails from ODOT • -999When we received blocks of data from 2001 and 2002, the 20-second data included records where volume=-999. This indicated that no cars passed. As of July 14, 2004 when PORTAL went live, we no longer received volume= -999. -999 are no longer an issue. • -1 From July 14, 2004 to November 17, 2004, we were receiving records with volume=-1. In all cases, when volume=-1, the SWARM algorithm variable status=2, which is the code for "OK." As I understand it, this is supposed to represent a communication error. • From November 18, 2004 to February 13, 2005, we did not receive any records where volume = -1. This is due to "Transport ATMS build 2.1, distributed on Nov 17, included many internal changes to traffic sensor and detector station data, (including station level data smoothing). Among other changes it started sending a 0 detector volume during Communications Failure with the Wapiti controllers, as noted by Robert Bertini." -David Gaarsoe • During this time period, it is impossible to distinguish between communication errors and periods where no cars pass. • Beginning on February 14, 2005, it appears that Transport ATMS build 2.3 was released. "Transport ATMS build 2.3 will ... once again report a -1 vol, -1 occupancy, and -1 speed, Inhibited status (0) for Traffic Detectors with no controller communications, and a 0 vol, 0 occupancy, 0 speed, Inhibited status for detectors of reporting controllers with no traffic data (no activity)." -David Gaarsoe • It appears that, beginning on February 14, 2005 we are now able to accurately distinguish between communication errors and periods where no vehicles pass a detector. • http://wiki.its.pdx.edu/wiki/portal/moin.cgi/ODOT_-1_Email
Data Quality Flags - Implementation • Queries or scripts analyze the data and set flags when criteria are met • We will be able to tell which data values failed which criteria
Aggregation over Lanes/Corridors • Volume • Over lanes: Sum • (average volume over detector stations doesn’t seem useful) • Occupancy • Over lanes: Weighted average of by-lane occupancies by volumes • (average occupancy over detector stations doesn’t seem useful) • Speed • Over lanes: Weighted average of by-lane speeds by volumes • Over detector stations: Weighted average of speeds by volumes (?)
Aggregation over Space, Cont’d • Vehicle Miles Traveled • Sum over lanes or loop detector stations • Vehicle Hours Traveled • Sum over lanes or loop detector stations • Travel Time • Over lanes: Weighted average of travel time by volumes • Over loop detector stations: Sum of segment-wide travel times • Delay • Over lanes: Weighted average of travel time by volumes • Over loop detector stations: Sum of segment-wide travel times