Beitbridge Truck GPS Data

Author

Sparkgeo

This notebook is cleaning the GPS data obtained by World Bank for future modelling and analysis.

Data Cleaning

  • This notebook is for checking data quality on the GPS data

  • Subsetting the data for our Study Period (2018-2022)

  • Adding zeros and na values for completenesss since not every hour or every day is represented consistently in this data.

Load Libraries

Note

We’ll be using tidyverse packages to read in and manipulate our data.

  • readr is for reading in tabular data

  • skimr provides a quick summary of tabular data

  • lubridate is for working with date formats and time series

library(readr)
library(tidyverse)
library(skimr)
library(lubridate)

Read in the raw data:

gps_data = read_csv("../data/raw/Beitbridge_Border_2017_02_01_to_2023_02_28.csv")

Remove redundant field names and grand total from rest of table.

#fix column names
gps_data = gps_data %>%
  rename_with(~ str_remove(., "Border_Crossing_"), everything())
# remove grand total row
grand_total = gps_data[1,]
gps_data =  gps_data[-1,]

We can use skim to get a quick overview of the data:

skim(gps_data)
Data summary
Name gps_data
Number of rows 68318
Number of columns 10
_______________________
Column type frequency:
character 2
logical 1
numeric 6
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Direction 0 1 11 12 0 4 0
GeozoneName 0 1 17 17 0 1 0

Variable type: logical

skim_variable n_missing complete_rate mean count
IsGrandTotalRowTotal 0 1 0 FAL: 68318

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ProjectID 0 1 1.30 0.71 1 1.00 1.00 1.00 3.0 ▇▁▁▁▂
StartHour 0 1 11.41 5.96 0 7.00 12.00 16.00 23.0 ▅▇▆▇▅
Count_Events 0 1 3.17 2.77 1 1.00 2.00 4.00 52.0 ▇▁▁▁▁
Median_Minutes 0 1 1523.23 1773.96 10 335.00 1029.37 2096.00 42633.0 ▇▁▁▁▁
Bottom_10__Minutes 0 1 1114.86 1510.78 10 235.60 634.70 1394.60 42633.0 ▇▁▁▁▁
Top_10__Minutes 0 1 2227.06 2460.80 10 497.02 1416.00 3074.45 53702.6 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
StartDate 0 1 2017-02-01 2023-02-28 2019-07-28 2210

Cut the data for our Study Period

  • Remove extra fields

  • Format StartDate as date

  • Replace spaces between Direction field for consistency

  • Split our data set into two separate sets:

    1. sa_zimbabwe for those travelling in the SA –> Zimbabwe Direction
    2. zimbabwe_sa for those travelling Zimbabwe –> SA Direction
gps_data = gps_data%>%
  filter(between(year(StartDate), 2018, 2022))%>%
  select(-c(IsGrandTotalRowTotal, GeozoneName, ProjectID))%>%
  mutate(StartDate = as_date(StartDate),
         Direction = str_replace_all(Direction," ",""))

sa_zimbabwe = gps_data %>%
  filter(Direction == "SA-Zimbabwe")

zimbabwe_sa = gps_data %>%
  filter(Direction == "Zimbabwe-SA")

Data Quality

  • Start_hour - is not consistent need to add zeros to count events and other fields.

  • Create a new date_table containing all of the dates and hours for our study period.

#YMD
#Hours 0 to 23

start_date = ymd_hm("2018-01-01 00:00")
end_date = ymd_hm("2022-12-31 23:00")

date_table = data.frame(StartDate = seq(start_date, end_date, by="hour"))

date_table = date_table %>%
  mutate(StartHour = hour(StartDate))%>%
  mutate(StartDate = as_date(StartDate))
  • Join our gps data from each subset to our date table to fill in missing start hours and values with zero.

  • Replace only Count_Events with zero and all na values applied to medians.

  • We’ll do this for both sa_zimbabwe and zimbabwe_sa

sa_zimbabwe = sa_zimbabwe %>%
  full_join(date_table)%>%
  mutate(Direction = "SA-Zimbabwe")%>%
  mutate(across(c(Count_Events), ~replace_na(.x,0)))%>%
  arrange(StartDate, StartHour)
zimbabwe_sa = zimbabwe_sa %>%
  full_join(date_table)%>%
  mutate(Direction = "Zimbabwe-SA")%>%
  mutate(across(c(Count_Events), ~replace_na(.x,0)))%>%
  arrange(StartDate, StartHour)

Put it back together into one data set.

beitbridge_border = bind_rows(zimbabwe_sa, sa_zimbabwe)

Save the data to a new csv into our processed folder.

write_csv(beitbridge_border, "../data/processed/Beitbridge_Counts_Wait_Time_2018_2022.csv")