1 / 36

Yongping Zhang Kouros Mohammadian, PhD Department of Civil and Materials Engineering

This project explores utilizing Bayesian MCMC with Gibbs Sampling to enhance data transferability, focusing on variables like land-use and transportation. It aims to predict household travel patterns accurately using existing datasets and advanced techniques.

tdeck
Download Presentation

Yongping Zhang Kouros Mohammadian, PhD Department of Civil and Materials Engineering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enhancing the Quality of Transferred Household Travel Survey Data: A Bayesian Updating Approach Using MCMC with Gibbs Sampling Yongping Zhang Kouros Mohammadian, PhD Department of Civil and Materials Engineering University of Illinois at Chicago The 11th TRB National Transportation Planning Applications Conference May 7, 2007

  2. Data Transferability • The idea is to use data collected in one context in a new context. This can reduce or eliminate the need for a large data collection in the application context. • Previous Studies • ITE trip generation tables • NCHRP 365 (Nancy McGuckin, et al) • Highly aggregate • ORNL’s NPTS/NHTS transferability study (Pat Hu, et al) • Aggregate (CT level) • Data simulation (Stopher and Greaves) • Disaggregate (HH level), C&RT classification method, limited number of independent variables

  3. Project Approach • Consider larger set of variables • NHTS and CTPP datasets • Use quantifiable variables that can be easily predicted or are available from other sources (e.g., PUMS) • Consider variables representing Land-use, Urban form, and transportation system characteristics • Advanced clustering, updating, and simulation approaches

  4. Data • Data Sources • 2001 NHTS, 2000 CTPP, PUMS, 2003 TTI, Tiger/Line GIS data files • Data Cleaning • 33 variables of demographics, socio-economics and land use • Individual level: Age group, Race/Ethnicity, Education, Occupation • Household level: HH size, Income, Adults, Vehicles, Drivers, Workers • Census tract level: Housing, Employment, and Population densities • New Variables

  5. New Variables • Intersection density (Tiger/Line) • No. of intersections / Area • Road density (Tiger/Line) • Road length / Area • Pedestrian environment (Tiger/Line) • Block size: Road length / No. of intersections • Transit friendly environment (CTPP) • Transit users / Total no. of workers • Transit trips / Total no. of trips • Congestion factor • Travel time index (TTI report for 85 MSAs) • Avg. travel time / Free flow TT in that region

  6. Dependent Variables • Travel Characteristics (from NHTS trip file aggregated to HH level) • VMT for each household • No. of trips • No. of mandatory trips • No. of maintenance trips • No. of discretionary trips • No. of transit trips in the HH • No. of private vehicle trips • No. of non-motorized (bicycles and walk) trips • No. of tours • Average trips per tour • Average trip distance in miles for all HH members • No. of transit users in the HH • No. of carpool users in the HH • Percentage of public transit usage in the HH • Percentage of carpool usage among workers in the HH • Total commute distance in the HH • Average commute distance in the HH

  7. Clustering • Classification schema is a critical issue • Clustering methods tested include: K-Means, hierarchical, C&RT, TwoStep, ANN • 11 clusters were generated using TwoStep clustering method • ONLY national data is used

  8. Clusters • Rich and Smart : • middle age families • professional or managerial white collar jobs • graduate degrees • high incomes • majority live in suburbs. • greater part are White but also some Asian • Young Achievers: • Young couples without children or mainly with pre-school children • college degrees • white collar jobs in sales, service, technical, and professional • mid-range income. • higher percentages live in suburb or rural areas. • Kids-centered Families: • middle aged and working class families • pre-school and school age children • usually have college education • mid-rage to high level income • primarily White and live in suburb or town

  9. Clusters, cont. • Rural Blues : • working class, middle aged families • pre-school and school age children • mainly high school graduates • blue collar jobs (farming, manufacturing, etc) • low to mid-range income • greater part are White and mainly live in rural area or small towns. • Working Mixing Pot : • working class White, Black, Asian, or Hispanic • single adults or couples • college or high school education • low to mid-range income • Mainstream Families: • mid-scale, upper mid age, White • large working class couples or families with older children • college or high school education • mid-range to high level income • suburb or rural areas

  10. Clusters, cont. • Senior Couples : • senior couples, • majority working and some are retired • greater part is White but include some Black, Asian, or American-Indians • suburb or rural areas. • Sustaining Minority Families: • low income, • middle aged, working class families • mainly Hispanic or Black but also some Asian and White • majority have not finished high school • service, sales, manufacturing, farming, or construction jobs • Forever Youngs : • White senior couples, empty nesters • mostly retired but some have sales, service, or managerial jobs • low to mid-range income

  11. Clusters, cont. • Traditional Seniors: • mainly retired single individuals and some retired couples • low income. • majority are White but some Black, Asian, or American-Indians • Neo Urbans: • Small families/couples or single individuals • dense urban areas • college education • low to mid-range income • sales, service, or professional jobs • dominant race is White but a significant number are Black, Asian, and Hispanic

  12. Cluster-Based Travel Characteristics

  13. Transferability • An ANN model (with genetic algorithm) is used to simulate cluster membership as a function of 11 factors for each HH in add-on datasets • The model has 92.4% prediction potential • Travel characteristics are transferred from national clusters to add-on data according to their cluster membership • Weighted observed and Predicted travel characteristics are compared

  14. Comparison of Weighted Trip Count per Person

  15. Comparison of Weighted Mandatory Trips per Person

  16. Original Comparison of Transit Usage Not so good! some clusters need improvement • Compared to No. of Trips, the prediction of transit usage is not so good. • Cluster 5,8,10,11 show significant difference and need improvement.

  17. Improvement to Clusters Using C&RT • 1.The first level of tree is grown upon the difference of the No. of vehicles in the household (own vehicle or not). • 2. Improvement of the model due to this level is defined by improvement/(Variance of Node 0). • For example, here 0.0017 equals to 13.3%, and 0.009 equals to 7.05% and 0.0002 equals to 1.57%. • Total model improvement is about 22%.

  18. Considering Distributions:Trip Rate Nice match shown! however, not always the case. How to improve the transferability?

  19. Considering Distributions:Trip Distance Not So Good! Needs to be improved

  20. Considering Distributions: • Various distributions were fitted to the dataset including: • Normal, Gamma, Weibull, Exponential, Max Extreme, Lognormal, Logistic, Student’s t, Min Extreme, Triangular, General Beta, Pareto, Uniform, Binomial, Geometric, Hyper Geometric, and Poisson. • The fitting results are interpreted by • examining the rankings of the three fit statistics • A-D, K-S, and Chi-squared statistics • visually judging of plots, density and cumulative curves • p-value and critical values at different sig. levels. • Non-normal distributions are dominant (e.g., Gamma)

  21. Gamma Distribution Gamma function: k > 0 is the shape parameter θ > 0 is the scale parameter the location parameter determines where the origin is located PDF CDF

  22. Fitted Distribution with Parameters for each Variable by Cluster

  23. Bayesian Updating • Local updating can significantly improve the quality of the transferred data • Used Bayesian updating • Traditionally in transferability literature only variables with normal distributions have been studied due to the simplicity in calculation of posterior from normal prior and likelihood. • In practice, the variables of interest (i.e., the likelihood) can take various distributional forms.

  24. Bayesian Updating f(x|θ) is the probability function for the observed data x (i.e., local sample), given the unknown parameter θ, g(θ) is the prior distribution for θ, k(θ|x) is the posterior distribution for θ given observed data x The technique can be expanded to situations when no prior data is available. The analyst can do successive updating, using the new information without losing the gains from the old one.

  25. Bayesian Updating (2) • The National sample of NHTS 2001 is used as the source for the prior information • A small local sample is randomly selected from the NY add-on, leaving the rest for validation • Bootstrap method is used to resample the data and justify the prior distribution assumptions of parameters of interest (i.e., scale and shape for Normal distribution), • Normal distribution is fitted to each of the resample datasets.

  26. Bayesian Updating (3) • Then, Markov Chain Monte Carlo (MCMC) simulation with Gibbs Sampling is utilized to update the prior with the small local sample. • Assuming the updated variables of interest are still Gamma distributed, the posterior of parameters are used to derive the updated means and SD of the variables. • Updated parameters are then compared with the validation data and national data to test the effectiveness of the updating procedure. • The comparisons prove that significant improvement is achieved. • The improvement increases with the local sample size • a relatively cost-effective sample size is suggested

  27. Root Mean Square Error (RMSE) decreases with the increase of sample size. • There is instability when the sample size within each cluster is smaller than 45 observations. • A sample size of 75 per cluster seems to be the most cost-effective plan.

  28. Updating Results • Updated mean values are significantly improved towards validation data.

  29. Summary of Updating Results

  30. Population Synthesizing and Travel Data Simulation • Using PUMS Data, NYC population is synthesized. • All of the contextual factors were calculated for each HH. • Synthetic population with all required 33 variables was generated. • Using the ANN model, cluster memberships are obtained. • Travel data are simulated for each HH using Monte Carlo simulation of each travel attribute with updated parameters of the fitted distributions.

  31. Comparison of Simulated and Add-on NYC Samples (Trips per Person)

  32. Comparison of Simulated and Add-on NYC Samples (Trip Distance per Person)

More Related