Figure from da_b_da on Unsplush.

The codes applied GBDT (Gradient Boosting Decision Tree) with the gbm package in R to analyze the Onboard Travel Survey data and explore the relationship between people’s walking distance to access transit stops and built environments while controlling for trip and demographic variables. If you think it is useful for your work, please kindly consider citing our work.

Tao, T., Wang, J., & Cao, X. (2020). Exploring the non-linear associations between spatial attributes and walking distance to transit. Journal of Transport Geography, 82, 102560. https://doi.org/10.1016/j.jtrangeo.2019.102560

1 Package and data

## load the package
library(gbm)

## Loaded gbm 2.1.8

## load the dataset
urbanloc <- read.csv('DisToTransit_final_data.csv')

2 Modelling

Construct parameter regularization. Please note that this part requires high computation and time cost.

## create a list to store the parameters and results
hyper_grid <- expand.grid(
  interaction.depth = seq(1, 49),  ## tree depth from 1 to 49
  optimal_trees = 0, ## create a variable to store the number of iterations with best model prediction performance
  min_RMSE = 0 ## create a variable to store RMSE
)

## parameter regularization
for (i in 1:nrow(hyper_grid)) {
  
  set.seed(2017) ## set seed to ensure repetition
  gbm.tune <- gbm(
    WalkDis ~ PeakHou + TripDis + NumTrans + WorkDes ## formula
    + PerPopWorAgeS + Per0CarHouS + PerLowWagWorHomS + PopDenS +
      JobDenS + EmHouMixS + NetDenS + IntDenS
    + Mal + Whi + Blk + You + Sen + HouInc + Lab + Veh + DriLic +
      TraPas,
    data = urbanloc, 
    distribution = 'gaussian', ## distribution of the dependent variable
    n.trees = 50000, ## number of trees
    shrinkage = 0.001, ## shrinkage
    interaction.depth = hyper_grid$interaction.depth[i], ## tree depth or complexity of tree
    cv.folds = 5 ## five-fold cross validation
  )
  
  # add min training error and trees to the list
  hyper_grid$optimal_trees[i] <- which.min(gbm.tune$cv.error)
  hyper_grid$min_RMSE[i] <- sqrt(gbm.tune$cv.error[hyper_grid$optimal_trees[i]])
}

3 Result visualization

Plot the result.

par(mar = c(5,5,2,5))
with(hyper_grid, plot(hyper_grid$min_RMSE, type='l', col='red3', ylab = 'RMSE'))
par(new = T)
with(hyper_grid, plot(hyper_grid$optimal_trees, type='l', axes=F, xlab=NA, ylab=NA, cex=1.2))
axis(side = 4)
mtext(side = 4, line = 3, 'Number of iterations')

Based on the plot, after 35, the RMSE becomes relatively stable and does not change very much. We use 35 as the selected tree depth. We plug it in the function and run again.

set.seed(2017) ## set seed to ensure repetition

gbm_fit <- gbm(
  WalkDis ~ PeakHou + TripDis + NumTrans + WorkDes
  + PerPopWorAgeS + Per0CarHouS + PerLowWagWorHomS + PopDenS +
    JobDenS + EmHouMixS + NetDenS + IntDenS
  + Mal + Whi + Blk + You + Sen + HouInc + Lab + Veh + DriLic +
    TraPas,
  data = urbanloc,
  distribution = "gaussian",
  n.trees = 5000, ## we use a smaller value here, since we have known that the optimal number of iterations is smaller than 5000, and this will help save the operation time
  shrinkage = 0.001,
  interaction.depth = 35, ## plug in the select value here
  cv.folds = 5
)

Plot the performance. In the figure below, the green line is MSE under cross validation’s testing dataset, the black line is MSE under training dataset.

## check tree number of the best performance under 5-fold cross-validation
par(mfrow = c(1, 1))
best.iter <- gbm.perf(gbm_fit, method = 'cv')

print(best.iter)   ## number of iterations with best performance under CV

## [1] 2559

print(sqrt(gbm_fit$cv.error[best.iter])) ## RMSE

## [1] 295.1982

Plot the relative importance

gbm_sum <- summary(gbm_fit, n.trees = best.iter) ## based on the estimated best number of trees

print(gbm_sum)

##                               var    rel.inf
## TripDis                   TripDis 16.0034580
## EmHouMixS               EmHouMixS 10.4903397
## PopDenS                   PopDenS  8.8773591
## PerPopWorAgeS       PerPopWorAgeS  8.5209898
## NetDenS                   NetDenS  8.4809508
## JobDenS                   JobDenS  7.9549339
## PerLowWagWorHomS PerLowWagWorHomS  7.7437103
## Per0CarHouS           Per0CarHouS  7.2182527
## HouInc                     HouInc  6.2177585
## IntDenS                   IntDenS  5.8218621
## NumTrans                 NumTrans  1.8508780
## Mal                           Mal  1.4821970
## DriLic                     DriLic  1.3968895
## PeakHou                   PeakHou  1.1596758
## WorkDes                   WorkDes  1.1361091
## Veh                           Veh  1.0545709
## Whi                           Whi  1.0313629
## Lab                           Lab  0.9835090
## Blk                           Blk  0.8096615
## TraPas                     TraPas  0.7768748
## You                           You  0.7070546
## Sen                           Sen  0.2816023

Plot the partial dependence plots (Use population density as an example).

Population Density

a = 0.8 # degree to which you want to smooth the partial dependence plot, from 0 to 1.
z = 'PopDenS' # variable name
y = 'Population Density (people/acre)' # description of variable
par(mfrow = c(1, 1))
c <-plot(gbm_fit,which(gbm_fit$var.names == z)[[1]],best.iter,return.grid = TRUE)
plot(c, type = "l")
s <- smooth.spline(c, spar = a) # smooth the plot
lines(s,col = "blue",ylab = "Predicted walking distance (meter)",xlab = paste(y, ", ", round(gbm_sum$rel.inf[gbm_sum$var == z], 1), "%")) # plot with relative importance
grid(nx = NULL,ny = NULL,col = "lightgray",lty = "dotted") # plot grid
box(lty = 'solid')

Pseudo R square

The calculation of pseudo \(R^2\) is using the equation below.

\[ R^2 = 1-\frac{(y-\hat{y})^2}{(y-\bar{y})^2} \]

y <- urbanloc$WalkDis
yhat <- predict(gbm_fit, data = urbanloc, n.trees = best.iter)
ybar <- mean(y)

rss <- sum((y - yhat)^2)
tss <- sum((y - ybar)^2)

(rsq <- 1- rss/tss)

## [1] 0.1922996

4 Comparison with linear regression

We used the codes below to estimate a linear regression model and calculate the relative importance based on improved r squared with the relaimpo package.

## fit a linear regression
lm_fit <- lm(WalkDis ~ PeakHou + TripDis + NumTrans + WorkDes
             + PerPopWorAgeS + Per0CarHouS + PerLowWagWorHomS + PopDenS +
               JobDenS + EmHouMixS + NetDenS + IntDenS
             + Mal + Whi + Blk + You + Sen + HouInc + Lab + Veh + DriLic +
               TraPas,
             data = urbanloc)

## present the result of linear regression
summary(lm_fit)

## 
## Call:
## lm(formula = WalkDis ~ PeakHou + TripDis + NumTrans + WorkDes + 
##     PerPopWorAgeS + Per0CarHouS + PerLowWagWorHomS + PopDenS + 
##     JobDenS + EmHouMixS + NetDenS + IntDenS + Mal + Whi + Blk + 
##     You + Sen + HouInc + Lab + Veh + DriLic + TraPas, data = urbanloc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -523.44 -210.63  -77.21  127.46 1330.45 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       45.5696    53.7256   0.848   0.3964    
## PeakHou          -15.8585     7.5781  -2.093   0.0364 *  
## TripDis           11.1908     1.3214   8.469  < 2e-16 ***
## NumTrans         -36.6695     7.9113  -4.635 3.64e-06 ***
## WorkDes           19.6218     8.7062   2.254   0.0242 *  
## PerPopWorAgeS     99.9722    49.5179   2.019   0.0435 *  
## Per0CarHouS        4.9132    31.3249   0.157   0.8754    
## PerLowWagWorHomS 150.8263    81.3212   1.855   0.0637 .  
## PopDenS           -1.5719     0.6163  -2.550   0.0108 *  
## JobDenS            0.8760     0.7425   1.180   0.2381    
## EmHouMixS         91.3099    21.6367   4.220 2.47e-05 ***
## NetDenS            1.5807     0.7252   2.180   0.0293 *  
## IntDenS            0.5503     0.2148   2.562   0.0104 *  
## Mal               13.5543     7.2514   1.869   0.0616 .  
## Whi               18.2739    10.3289   1.769   0.0769 .  
## Blk               20.3240    11.0195   1.844   0.0652 .  
## You               29.0074    19.4217   1.494   0.1353    
## Sen              -30.1860    18.1199  -1.666   0.0958 .  
## HouInc             0.9506     2.5073   0.379   0.7046    
## Lab               20.6445    10.8316   1.906   0.0567 .  
## Veh              -21.9839     8.9819  -2.448   0.0144 *  
## DriLic           -12.0865     9.0248  -1.339   0.1805    
## TraPas            22.1232    12.0683   1.833   0.0668 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 294.6 on 6629 degrees of freedom
##   (1235 observations deleted due to missingness)
## Multiple R-squared:  0.02713,    Adjusted R-squared:  0.0239 
## F-statistic: 8.401 on 22 and 6629 DF,  p-value: < 2.2e-16

library(relaimpo)
RI_R2 <- calc.relimp(lm_fit, type = c("lmg"), rela = TRUE)
plot(RI_R2)

5 Tips:

1.You could save the gbm fitted model to RDS file with the code below.

saveRDS(gbm_fit, 'gbm_fit.rds')

Next time when you want to use the model result, you could read the gbm fitted result with the code below.

readRDS('gbm_fit.rds')

This is useful for large complex models which need high computation and time cost.

6 Reference:

1.Gradient Boosting Machines. Link
2.Elith, J., Leathwick, J.R., Hastie, T., 2008. A working guide to boosted regression trees. J. Anim. Ecol. 77, 802–813. Link

Thanks for reading! We appreciate your time!