Figure from da_b_da on Unsplush.
The codes applied GBDT (Gradient Boosting Decision Tree) with the gbm
package in R to analyze the Onboard Travel Survey data and explore the relationship between people’s walking distance to access transit stops and built environments while controlling for trip and demographic variables. If you think it is useful for your work, please kindly consider citing our work.
Tao, T., Wang, J., & Cao, X. (2020). Exploring the non-linear associations between spatial attributes and walking distance to transit. Journal of Transport Geography, 82, 102560. https://doi.org/10.1016/j.jtrangeo.2019.102560
## load the package
library(gbm)
## Loaded gbm 2.1.8
## load the dataset
<- read.csv('DisToTransit_final_data.csv') urbanloc
Construct parameter regularization. Please note that this part requires high computation and time cost.
## create a list to store the parameters and results
<- expand.grid(
hyper_grid interaction.depth = seq(1, 49), ## tree depth from 1 to 49
optimal_trees = 0, ## create a variable to store the number of iterations with best model prediction performance
min_RMSE = 0 ## create a variable to store RMSE
)
## parameter regularization
for (i in 1:nrow(hyper_grid)) {
set.seed(2017) ## set seed to ensure repetition
<- gbm(
gbm.tune ~ PeakHou + TripDis + NumTrans + WorkDes ## formula
WalkDis + PerPopWorAgeS + Per0CarHouS + PerLowWagWorHomS + PopDenS +
+ EmHouMixS + NetDenS + IntDenS
JobDenS + Mal + Whi + Blk + You + Sen + HouInc + Lab + Veh + DriLic +
TraPas,data = urbanloc,
distribution = 'gaussian', ## distribution of the dependent variable
n.trees = 50000, ## number of trees
shrinkage = 0.001, ## shrinkage
interaction.depth = hyper_grid$interaction.depth[i], ## tree depth or complexity of tree
cv.folds = 5 ## five-fold cross validation
)
# add min training error and trees to the list
$optimal_trees[i] <- which.min(gbm.tune$cv.error)
hyper_grid$min_RMSE[i] <- sqrt(gbm.tune$cv.error[hyper_grid$optimal_trees[i]])
hyper_grid }
Plot the result.
par(mar = c(5,5,2,5))
with(hyper_grid, plot(hyper_grid$min_RMSE, type='l', col='red3', ylab = 'RMSE'))
par(new = T)
with(hyper_grid, plot(hyper_grid$optimal_trees, type='l', axes=F, xlab=NA, ylab=NA, cex=1.2))
axis(side = 4)
mtext(side = 4, line = 3, 'Number of iterations')
Based on the plot, after 35, the RMSE becomes relatively stable and does not change very much. We use 35 as the selected tree depth. We plug it in the function and run again.
set.seed(2017) ## set seed to ensure repetition
<- gbm(
gbm_fit ~ PeakHou + TripDis + NumTrans + WorkDes
WalkDis + PerPopWorAgeS + Per0CarHouS + PerLowWagWorHomS + PopDenS +
+ EmHouMixS + NetDenS + IntDenS
JobDenS + Mal + Whi + Blk + You + Sen + HouInc + Lab + Veh + DriLic +
TraPas,data = urbanloc,
distribution = "gaussian",
n.trees = 5000, ## we use a smaller value here, since we have known that the optimal number of iterations is smaller than 5000, and this will help save the operation time
shrinkage = 0.001,
interaction.depth = 35, ## plug in the select value here
cv.folds = 5
)
Plot the performance. In the figure below, the green line is MSE under cross validation’s testing dataset, the black line is MSE under training dataset.
## check tree number of the best performance under 5-fold cross-validation
par(mfrow = c(1, 1))
<- gbm.perf(gbm_fit, method = 'cv') best.iter
print(best.iter) ## number of iterations with best performance under CV
## [1] 2559
print(sqrt(gbm_fit$cv.error[best.iter])) ## RMSE
## [1] 295.1982
Plot the relative importance
<- summary(gbm_fit, n.trees = best.iter) ## based on the estimated best number of trees gbm_sum
print(gbm_sum)
## var rel.inf
## TripDis TripDis 16.0034580
## EmHouMixS EmHouMixS 10.4903397
## PopDenS PopDenS 8.8773591
## PerPopWorAgeS PerPopWorAgeS 8.5209898
## NetDenS NetDenS 8.4809508
## JobDenS JobDenS 7.9549339
## PerLowWagWorHomS PerLowWagWorHomS 7.7437103
## Per0CarHouS Per0CarHouS 7.2182527
## HouInc HouInc 6.2177585
## IntDenS IntDenS 5.8218621
## NumTrans NumTrans 1.8508780
## Mal Mal 1.4821970
## DriLic DriLic 1.3968895
## PeakHou PeakHou 1.1596758
## WorkDes WorkDes 1.1361091
## Veh Veh 1.0545709
## Whi Whi 1.0313629
## Lab Lab 0.9835090
## Blk Blk 0.8096615
## TraPas TraPas 0.7768748
## You You 0.7070546
## Sen Sen 0.2816023
Plot the partial dependence plots (Use population density as an example).
Population Density
= 0.8 # degree to which you want to smooth the partial dependence plot, from 0 to 1.
a = 'PopDenS' # variable name
z = 'Population Density (people/acre)' # description of variable
y par(mfrow = c(1, 1))
<-plot(gbm_fit,which(gbm_fit$var.names == z)[[1]],best.iter,return.grid = TRUE)
c plot(c, type = "l")
<- smooth.spline(c, spar = a) # smooth the plot
s lines(s,col = "blue",ylab = "Predicted walking distance (meter)",xlab = paste(y, ", ", round(gbm_sum$rel.inf[gbm_sum$var == z], 1), "%")) # plot with relative importance
grid(nx = NULL,ny = NULL,col = "lightgray",lty = "dotted") # plot grid
box(lty = 'solid')
Pseudo R square
The calculation of pseudo \(R^2\) is using the equation below.
\[ R^2 = 1-\frac{(y-\hat{y})^2}{(y-\bar{y})^2} \]
<- urbanloc$WalkDis
y <- predict(gbm_fit, data = urbanloc, n.trees = best.iter)
yhat <- mean(y)
ybar
<- sum((y - yhat)^2)
rss <- sum((y - ybar)^2)
tss
<- 1- rss/tss) (rsq
## [1] 0.1922996
We used the codes below to estimate a linear regression model and calculate the relative importance based on improved r squared with the relaimpo
package.
## fit a linear regression
<- lm(WalkDis ~ PeakHou + TripDis + NumTrans + WorkDes
lm_fit + PerPopWorAgeS + Per0CarHouS + PerLowWagWorHomS + PopDenS +
+ EmHouMixS + NetDenS + IntDenS
JobDenS + Mal + Whi + Blk + You + Sen + HouInc + Lab + Veh + DriLic +
TraPas,data = urbanloc)
## present the result of linear regression
summary(lm_fit)
##
## Call:
## lm(formula = WalkDis ~ PeakHou + TripDis + NumTrans + WorkDes +
## PerPopWorAgeS + Per0CarHouS + PerLowWagWorHomS + PopDenS +
## JobDenS + EmHouMixS + NetDenS + IntDenS + Mal + Whi + Blk +
## You + Sen + HouInc + Lab + Veh + DriLic + TraPas, data = urbanloc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -523.44 -210.63 -77.21 127.46 1330.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.5696 53.7256 0.848 0.3964
## PeakHou -15.8585 7.5781 -2.093 0.0364 *
## TripDis 11.1908 1.3214 8.469 < 2e-16 ***
## NumTrans -36.6695 7.9113 -4.635 3.64e-06 ***
## WorkDes 19.6218 8.7062 2.254 0.0242 *
## PerPopWorAgeS 99.9722 49.5179 2.019 0.0435 *
## Per0CarHouS 4.9132 31.3249 0.157 0.8754
## PerLowWagWorHomS 150.8263 81.3212 1.855 0.0637 .
## PopDenS -1.5719 0.6163 -2.550 0.0108 *
## JobDenS 0.8760 0.7425 1.180 0.2381
## EmHouMixS 91.3099 21.6367 4.220 2.47e-05 ***
## NetDenS 1.5807 0.7252 2.180 0.0293 *
## IntDenS 0.5503 0.2148 2.562 0.0104 *
## Mal 13.5543 7.2514 1.869 0.0616 .
## Whi 18.2739 10.3289 1.769 0.0769 .
## Blk 20.3240 11.0195 1.844 0.0652 .
## You 29.0074 19.4217 1.494 0.1353
## Sen -30.1860 18.1199 -1.666 0.0958 .
## HouInc 0.9506 2.5073 0.379 0.7046
## Lab 20.6445 10.8316 1.906 0.0567 .
## Veh -21.9839 8.9819 -2.448 0.0144 *
## DriLic -12.0865 9.0248 -1.339 0.1805
## TraPas 22.1232 12.0683 1.833 0.0668 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 294.6 on 6629 degrees of freedom
## (1235 observations deleted due to missingness)
## Multiple R-squared: 0.02713, Adjusted R-squared: 0.0239
## F-statistic: 8.401 on 22 and 6629 DF, p-value: < 2.2e-16
library(relaimpo)
<- calc.relimp(lm_fit, type = c("lmg"), rela = TRUE)
RI_R2 plot(RI_R2)
1.You could save the gbm
fitted model to RDS file with the code below.
saveRDS(gbm_fit, 'gbm_fit.rds')
Next time when you want to use the model result, you could read the gbm
fitted result with the code below.
readRDS('gbm_fit.rds')
This is useful for large complex models which need high computation and time cost.