180 likes | 361 Views
Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records A Case Study of NYC. 游晟佑 2.6.2013. About A uthor Introduction Backgrounds and Related Works Method and Discussions Experiments and Results Conclusion Critique Appendix. About Author.
E N D
Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip RecordsA Case Study of NYC 游晟佑 2.6.2013
About Author • Introduction • Backgrounds and Related Works • Method and Discussions • Experiments and Results • Conclusion • Critique • Appendix
About Author • JiantingZhang • Assistant professor in Geographical Information System (GIS) • CS@CUNY City College, CS@CUNY Graduate Center • A member of Geospatial Technologies and Environmental Cyberinfrastructure (GeoTECI) Lab
Backgrounds and Related Works Detect outliers: (whole trip detection / point detection) (few) • Easiest way:threshold (for example: trip >=30km, trip <=200meter) (Antrip: point detection) • A More General way: to compute the distribution of measurement (location, distance, duration) • Special aspects: if the pick up / drop off location (point detection) is in some land use types (i.e. lake / river), it is an outlier
Method and Discussions • Shortest Path(A* algorithm; Dijkstra) • Why shortest path? They only want to detect Outliers. Distance is an important factor in its nature. • But they use CH (Contraction Hierarchies) developed by KarlsruheInstitute of Tech. (KIT, in Germany) in MoNav (an open source package) • Why CH ? CH is designed specifically for road networks have achieved significant higher efficiencies than the generic ones.
Method and Discussions(Cont.) • A taxi trip has following attributes: • pickup location, pickup time, drop-off location, drop-off time, recorded distance, a street network with N nodes and M edges
Method and Discussions(Cont.) • Four steps to detect outliers: • If pickup, drop-off locations cannot be snapped into nearest street segments with a reasonable distance (D0), it is considered as Type I outlier • Compute the uniquecombinations of pickup and drop-off nodes of all trips • generate shortest trips (using the MoNav-CH module) • The computed shortest path distances are then compared with the recorded distances. • If the computed distances are greater than a threshold D1and are W times longer than the recorded distances, then the records are marked as type II outliers.
Experiments and Results • Data and Experimental Setting • Data from NAVTEQ, 166million taxi trips in NYC in 2009 • More than 20 attributes in a trip but they only use some of them
Experiments and Results(Cont.) • Distributions of Trip Distances, Time, Speed and Fare
Experiments and Results(Cont.) • Results on taxi trip outlier detection D0 = 200 feet, D1 = 3, W = 2, 166 million trips 1.5% fall into Type I outlier
Experiments and Results(Cont.) • Trip大多發生在midtown and downtown
Experiments and Results(Cont.) • About 18,000 trips are fall into Type II outliers • Recorded vs calculated (They want to increase D1 to get more outliers more false positives)
Experiments and Results(Cont.) • Results on Betweeness Centrality(gen by Monav-CH) • Just a by-product • 參與中間度指標(betweenness centrality), 一個edge or 一個point, 上面經過的shortest path 總計量 • 用來看出一個path重要性
Conclusion • 有效的detect outliers
Critique • It wound be better to classify errors from device / human / etc. Not just delete all “suspects” • 以下是題外話, 與outlierdetection無關, 與trip planning 有關: • Geospatial data is not enough for trip planning • It wound be better to consider real time data into accounts (for example: traffic congestion, 尖峰離峰時間, 可能不同) • 不能只考慮shortest path, What if 4miles vs 5miles = 20 minsvs 15mins under the same OD (origin destination)?
Appendix MoNav is a Desktop / Mobile application that offers state-of-the-art fast and exact routing with OpenStreetMap Data. http://wiki.openstreetmap.org/wiki/MoNav http://code.google.com/p/monav/