library(readr)
library(dplyr)
library(ggplot2)
library(skimr)
library(lubridate)
library(tidymodels)
library(forecast)
library(tidyverse)
library(tsibble)
library(caret)
library(mlbench)
Beitbridge Time Series Modelling
The following notebook is not complete.
The intention of this notebook is to attempt to model border wait time based on the border wait time data provided by World Bank. In the future, this was intended to be correlated and/or modelling with remote sensing derived traffic density index (TDI).
To approach this problem, we can use a time-series analysis technique. Since we are given data over three years, we can split it into training and testing datasets to evaluate the model’s performance. Here are the steps I would take:
1. Data exploration and cleaning: Explore the dataset to check for any missing or inconsistent data. Also, visualize the data to understand its patterns and trends. This was done in our gps_eda.qmd
notebook.
2. Feature engineering: Create new features that could potentially help in predicting the wait time, such as day of the week, month, holidays, and any significant events that could impact border crossing.
3. Time-series modeling: Since our data is time-series, we can use a time-series model to predict wait times. We can use various models such as Autoregressive Integrated Moving Average (ARIMA), Seasonal ARIMA (SARIMA), or Prophet.
4. Model validation: After training the model, we can use the test dataset to evaluate its performance. We can use metrics such as Mean Absolute Error (MAE) and Mean Squared Error (MSE) to evaluate the model’s accuracy.
5. Visualization: Finally, we can visualize the predicted wait times against the actual wait times to understand the model’s performance visually.
Caveats and uncertainties:
1. Corrupt area: Since there has been reports of corruption, we can not guarantee the authenticity of the data. There could be bribes to cross the border quicker than others.
2. External factors: There could be external factors such as political instability, social unrest, or natural disasters, which could impact the border wait times. Since our dataset does not include these factors, our model might not perform well in predicting such events.
3. Limited dataset: We only have three years of data, which might not be enough to capture long-term trends and patterns accurately. If there are significant changes in the border crossing policies or infrastructure, our model might not be able to capture such changes.
Overall, while a time-series model could help predict border wait times, we need to be aware of the caveats and uncertainties that could impact the model’s accuracy.
Read in the libraries and data
= read_csv("../data/processed/Beitbridge_Counts_Wait_Time_2018_2022.csv") border_data
Subsets
SA - Zimbabwe
= border_data %>% filter(Direction == "SA-Zimbabwe") sa_zimbabwe
Zimbabwe - SA
= border_data %>% filter(Direction == "Zimbabwe-SA") zimbabwe_sa
Plotting
Convert to time series
<- sa_zimbabwe
border_data_ts $datetime <- ymd_h(str_c(border_data_ts $StartDate, border_data_ts $StartHour))
border_data_ts= border_data_ts%>% select(datetime, Median_Minutes, Count_Events) %>%
border_data_ts as_tsibble(index = datetime)
<- sa_zimbabwe%>%
daily_counts group_by(date=date(StartDate))%>%
summarize(Daily_Counts = sum(Count_Events,na.rm=TRUE))%>%
ungroup()%>%
as_tsibble(index=date)
Moving averages
Creating moving averages for 5 days, 7 days and 14 days.
= daily_counts %>% mutate(ma_5 = ma(Daily_Counts, 5), ma_7=ma(Daily_Counts,7), ma_14 = ma(Daily_Counts, 14)) daily_counts
%>%
daily_counts #filter(year(date)==2021)%>%
ggplot()+
geom_line(aes(x=date, y=Daily_Counts, col="Daily Counts"))+
geom_line(aes(x=date, y=ma_14, col="14-ma"))+
scale_colour_manual("Legend",values = c("14-ma" = "blue", "Daily Counts"="grey")) +
labs(y = "Daily Counts", x = "Date")
Zoom in on year 2021
%>%
daily_counts filter(year(date)==2021)%>%
ggplot()+
geom_line(aes(x=date, y=Daily_Counts, col="Daily Counts"))+
geom_line(aes(x=date, y=ma_14, col="14-ma"))+
scale_colour_manual("Legend",values = c("14-ma" = "blue", "Daily Counts"="grey")) +
labs(y = "Daily Counts", x = "Date")
Seasonal Decomposition
- did not get to implement or test further due to budgetary constraints
ggAcf(daily_counts)
Caret - Linear Model
We can create additional features from the GPS data like:
Day of week
Month
Year
Hour
# Load the necessary libraries
# Read in the data
<- zimbabwe_sa
border_data #border_data<-read_csv('data.csv')
# Convert date and hour columns to datetime format
$datetime <- ymd_h(paste(border_data$StartDate, border_data$StartHour))
border_data
# Create new features
$day_of_week <- weekdays(border_data$datetime)
border_data$hour_of_day <- hour(border_data$datetime)
border_data$month <- month(border_data$datetime)
border_data$year <- year(border_data$datetime)
border_data# Split the data into training and testing sets
set.seed(123)
= border_data %>% filter(!is.na(Median_Minutes))
border_data <- createDataPartition(border_data$Median_Minutes, p = 0.8, list = FALSE)
training_indices <- border_data[training_indices,]
border_data_train <- border_data[-training_indices,]
border_data_test
# Define the model formula
<- Median_Minutes~ day_of_week + hour_of_day + month + year + Count_Events
model_formula
# Train the model using caret
<- train(model_formula, data = border_data_train, method = "lm")
border_model
# Make predictions on the testing set
<- predict(border_model, newdata = border_data_test)
border_predictions
# Evaluate the model's performance using Mean Absolute Error
<- mean(abs(border_predictions - border_data_test$Median_Minutes))
mae
# Print the MAE
print(mae)
[1] 678.9349
summary(border_model)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-1494 -605 -294 63 41854
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.088e+05 1.504e+04 -33.828 < 2e-16 ***
day_of_weekMonday -2.519e+01 3.976e+01 -0.633 0.5264
day_of_weekSaturday 4.554e+01 3.920e+01 1.162 0.2454
day_of_weekSunday -2.740e+01 3.919e+01 -0.699 0.4844
day_of_weekThursday 1.736e+01 3.905e+01 0.445 0.6566
day_of_weekTuesday -6.136e+01 3.914e+01 -1.568 0.1170
day_of_weekWednesday -7.014e+00 3.913e+01 -0.179 0.8577
hour_of_day 1.600e+01 1.846e+00 8.670 < 2e-16 ***
month -7.094e+00 3.093e+00 -2.294 0.0218 *
year 2.523e+02 7.447e+00 33.874 < 2e-16 ***
Count_Events 2.305e+01 3.662e+00 6.294 3.16e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1528 on 21174 degrees of freedom
Multiple R-squared: 0.05729, Adjusted R-squared: 0.05684
F-statistic: 128.7 on 10 and 21174 DF, p-value: < 2.2e-16
Random Forests
# Load the dataset
= border_data
df = df %>% select(Count_Events, Median_Minutes,datetime,day_of_week,year) df
This will take awhile to run on a laptop! Beware of running below:
# Create a train/test split
set.seed(123)
<- createDataPartition(df$Median_Minutes, p = .8, list = FALSE)
trainIndex <- df[ trainIndex,]
train <- df[-trainIndex,]
test
# Fit a random forest model
set.seed(123)
<- train(Median_Minutes ~ ., data = train, method = "rf")
rf_model
# Print the model results
rf_model
Random Forest
21185 samples
4 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 21185, 21185, 21185, 21185, 21185, 21185, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 1487.270 0.09307127 697.2415
5 1546.652 0.06958624 713.6039
9 1677.811 0.05062974 758.2495
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.
It is important to note, however, that the performance of a random forest model for predicting median border wait time would depend on the quality and relevance of the data and features used to train the model. Additionally, it may be necessary to regularly retrain the model with new data to ensure that it remains accurate over time.
TODO: - subset by years, model individual years to account for covid - use moving averages - interpolate missing median wait time values - try different ARIMA models, GAMs and Prophet library