Beitbridge Time Series Modelling

Author

Sparkgeo

The following notebook is not complete.

The intention of this notebook is to attempt to model border wait time based on the border wait time data provided by World Bank. In the future, this was intended to be correlated and/or modelling with remote sensing derived traffic density index (TDI).

To approach this problem, we can use a time-series analysis technique. Since we are given data over three years, we can split it into training and testing datasets to evaluate the model’s performance. Here are the steps I would take:

1. Data exploration and cleaning: Explore the dataset to check for any missing or inconsistent data. Also, visualize the data to understand its patterns and trends. This was done in our gps_eda.qmd notebook.

2. Feature engineering: Create new features that could potentially help in predicting the wait time, such as day of the week, month, holidays, and any significant events that could impact border crossing.

3. Time-series modeling: Since our data is time-series, we can use a time-series model to predict wait times. We can use various models such as Autoregressive Integrated Moving Average (ARIMA), Seasonal ARIMA (SARIMA), or Prophet.

4. Model validation: After training the model, we can use the test dataset to evaluate its performance. We can use metrics such as Mean Absolute Error (MAE) and Mean Squared Error (MSE) to evaluate the model’s accuracy.

5. Visualization: Finally, we can visualize the predicted wait times against the actual wait times to understand the model’s performance visually.

Caveats and uncertainties:

1. Corrupt area: Since there has been reports of corruption, we can not guarantee the authenticity of the data. There could be bribes to cross the border quicker than others.

2. External factors: There could be external factors such as political instability, social unrest, or natural disasters, which could impact the border wait times. Since our dataset does not include these factors, our model might not perform well in predicting such events.

3. Limited dataset: We only have three years of data, which might not be enough to capture long-term trends and patterns accurately. If there are significant changes in the border crossing policies or infrastructure, our model might not be able to capture such changes.

Overall, while a time-series model could help predict border wait times, we need to be aware of the caveats and uncertainties that could impact the model’s accuracy.

Read in the libraries and data

library(readr)
library(dplyr)
library(ggplot2)
library(skimr)
library(lubridate)
library(tidymodels)
library(forecast)
library(tidyverse)
library(tsibble)
library(caret)
library(mlbench)

border_data = read_csv("../data/processed/Beitbridge_Counts_Wait_Time_2018_2022.csv")

Subsets

SA - Zimbabwe

sa_zimbabwe = border_data %>% filter(Direction == "SA-Zimbabwe")

Zimbabwe - SA

zimbabwe_sa = border_data %>% filter(Direction == "Zimbabwe-SA")

Plotting

Convert to time series

border_data_ts <- sa_zimbabwe 
border_data_ts$datetime <- ymd_h(str_c(border_data_ts $StartDate, border_data_ts $StartHour))
border_data_ts = border_data_ts%>% select(datetime, Median_Minutes, Count_Events) %>% 
  as_tsibble(index = datetime)

daily_counts <- sa_zimbabwe%>%
  group_by(date=date(StartDate))%>%
  summarize(Daily_Counts = sum(Count_Events,na.rm=TRUE))%>%
  ungroup()%>%
  as_tsibble(index=date)

Moving averages

Creating moving averages for 5 days, 7 days and 14 days.

daily_counts = daily_counts %>% mutate(ma_5 = ma(Daily_Counts, 5), ma_7=ma(Daily_Counts,7), ma_14 = ma(Daily_Counts, 14))

daily_counts %>%
  #filter(year(date)==2021)%>%
  ggplot()+
  geom_line(aes(x=date, y=Daily_Counts, col="Daily Counts"))+
  geom_line(aes(x=date, y=ma_14, col="14-ma"))+
  scale_colour_manual("Legend",values = c("14-ma" = "blue", "Daily Counts"="grey")) +
  labs(y = "Daily Counts", x = "Date")

Zoom in on year 2021

daily_counts %>%
  filter(year(date)==2021)%>%
  ggplot()+
  geom_line(aes(x=date, y=Daily_Counts, col="Daily Counts"))+
  geom_line(aes(x=date, y=ma_14, col="14-ma"))+
  scale_colour_manual("Legend",values = c("14-ma" = "blue", "Daily Counts"="grey")) +
  labs(y = "Daily Counts", x = "Date")

Seasonal Decomposition

did not get to implement or test further due to budgetary constraints

ggAcf(daily_counts)

Caret - Linear Model

We can create additional features from the GPS data like:
- Day of week
- Month
- Year
- Hour

# Load the necessary libraries


# Read in the data
border_data <- zimbabwe_sa
#border_data<-read_csv('data.csv')
# Convert date and hour columns to datetime format
border_data$datetime <- ymd_h(paste(border_data$StartDate, border_data$StartHour))

# Create new features
border_data$day_of_week <- weekdays(border_data$datetime)
border_data$hour_of_day <- hour(border_data$datetime)
border_data$month <- month(border_data$datetime)
border_data$year <- year(border_data$datetime)
# Split the data into training and testing sets
set.seed(123)
border_data = border_data %>% filter(!is.na(Median_Minutes))
training_indices <- createDataPartition(border_data$Median_Minutes, p = 0.8, list = FALSE)
border_data_train <- border_data[training_indices,]
border_data_test <- border_data[-training_indices,]

# Define the model formula
model_formula <- Median_Minutes~ day_of_week + hour_of_day + month + year + Count_Events

# Train the model using caret
border_model <- train(model_formula, data = border_data_train, method = "lm")

# Make predictions on the testing set
border_predictions <- predict(border_model, newdata = border_data_test)

# Evaluate the model's performance using Mean Absolute Error
mae <- mean(abs(border_predictions - border_data_test$Median_Minutes))

# Print the MAE
print(mae)

[1] 678.9349

summary(border_model)


Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
   Min     1Q Median     3Q    Max 
 -1494   -605   -294     63  41854 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -5.088e+05  1.504e+04 -33.828  < 2e-16 ***
day_of_weekMonday    -2.519e+01  3.976e+01  -0.633   0.5264    
day_of_weekSaturday   4.554e+01  3.920e+01   1.162   0.2454    
day_of_weekSunday    -2.740e+01  3.919e+01  -0.699   0.4844    
day_of_weekThursday   1.736e+01  3.905e+01   0.445   0.6566    
day_of_weekTuesday   -6.136e+01  3.914e+01  -1.568   0.1170    
day_of_weekWednesday -7.014e+00  3.913e+01  -0.179   0.8577    
hour_of_day           1.600e+01  1.846e+00   8.670  < 2e-16 ***
month                -7.094e+00  3.093e+00  -2.294   0.0218 *  
year                  2.523e+02  7.447e+00  33.874  < 2e-16 ***
Count_Events          2.305e+01  3.662e+00   6.294 3.16e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1528 on 21174 degrees of freedom
Multiple R-squared:  0.05729,   Adjusted R-squared:  0.05684 
F-statistic: 128.7 on 10 and 21174 DF,  p-value: < 2.2e-16

Random Forests

# Load the dataset
df = border_data
df = df %>% select(Count_Events, Median_Minutes,datetime,day_of_week,year)

Important

This will take awhile to run on a laptop! Beware of running below:

# Create a train/test split
set.seed(123)
trainIndex <- createDataPartition(df$Median_Minutes, p = .8, list = FALSE)
train <- df[ trainIndex,]
test <- df[-trainIndex,]

# Fit a random forest model
set.seed(123)
rf_model <- train(Median_Minutes ~ ., data = train, method = "rf")

# Print the model results
rf_model

Random Forest 

21185 samples
    4 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 21185, 21185, 21185, 21185, 21185, 21185, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared    MAE     
  2     1487.270  0.09307127  697.2415
  5     1546.652  0.06958624  713.6039
  9     1677.811  0.05062974  758.2495

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.

It is important to note, however, that the performance of a random forest model for predicting median border wait time would depend on the quality and relevance of the data and features used to train the model. Additionally, it may be necessary to regularly retrain the model with new data to ensure that it remains accurate over time.

TODO: - subset by years, model individual years to account for covid - use moving averages - interpolate missing median wait time values - try different ARIMA models, GAMs and Prophet library