Beitbridge Truck GPS Data

Author

Sparkgeo

This notebook is cleaning the GPS data obtained by World Bank for future modelling and analysis.

Data Cleaning

This notebook is for checking data quality on the GPS data
Subsetting the data for our Study Period (2018-2022)
Adding zeros and na values for completenesss since not every hour or every day is represented consistently in this data.

Load Libraries

Note

We’ll be using tidyverse packages to read in and manipulate our data.

readr is for reading in tabular data
skimr provides a quick summary of tabular data
lubridate is for working with date formats and time series

library(readr)
library(tidyverse)
library(skimr)
library(lubridate)

Read in the raw data:

gps_data = read_csv("../data/raw/Beitbridge_Border_2017_02_01_to_2023_02_28.csv")

Remove redundant field names and grand total from rest of table.

#fix column names
gps_data = gps_data %>%
  rename_with(~ str_remove(., "Border_Crossing_"), everything())
# remove grand total row
grand_total = gps_data[1,]
gps_data =  gps_data[-1,]

We can use skim to get a quick overview of the data:

skim(gps_data)

Data summary
Name	gps_data
Number of rows	68318
Number of columns	10
_______________________
Column type frequency:
character	2
logical	1
numeric	6
POSIXct	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Direction	0	1	11	12	0	4	0
GeozoneName	0	1	17	17	0	1	0

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
IsGrandTotalRowTotal	0	1	0	FAL: 68318

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ProjectID	1	1.30	0.71	1	1.00	1.00	1.00	3.0	▇▁▁▁▂
StartHour	1	11.41	5.96	0	7.00	12.00	16.00	23.0	▅▇▆▇▅
Count_Events	1	3.17	2.77	1	1.00	2.00	4.00	52.0	▇▁▁▁▁
Median_Minutes	1	1523.23	1773.96	10	335.00	1029.37	2096.00	42633.0	▇▁▁▁▁
Bottom_10__Minutes	1	1114.86	1510.78	10	235.60	634.70	1394.60	42633.0	▇▁▁▁▁
Top_10__Minutes	1	2227.06	2460.80	10	497.02	1416.00	3074.45	53702.6	▇▁▁▁▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
StartDate	0	1	2017-02-01	2023-02-28	2019-07-28	2210

Cut the data for our Study Period

Remove extra fields
Format StartDate as date
Replace spaces between Direction field for consistency
Split our data set into two separate sets:
1. sa_zimbabwe for those travelling in the SA –> Zimbabwe Direction
2. zimbabwe_sa for those travelling Zimbabwe –> SA Direction

gps_data = gps_data%>%
  filter(between(year(StartDate), 2018, 2022))%>%
  select(-c(IsGrandTotalRowTotal, GeozoneName, ProjectID))%>%
  mutate(StartDate = as_date(StartDate),
         Direction = str_replace_all(Direction," ",""))

sa_zimbabwe = gps_data %>%
  filter(Direction == "SA-Zimbabwe")

zimbabwe_sa = gps_data %>%
  filter(Direction == "Zimbabwe-SA")

Data Quality

Start_hour - is not consistent need to add zeros to count events and other fields.
Create a new date_table containing all of the dates and hours for our study period.

#YMD
#Hours 0 to 23

start_date = ymd_hm("2018-01-01 00:00")
end_date = ymd_hm("2022-12-31 23:00")

date_table = data.frame(StartDate = seq(start_date, end_date, by="hour"))

date_table = date_table %>%
  mutate(StartHour = hour(StartDate))%>%
  mutate(StartDate = as_date(StartDate))

Join our gps data from each subset to our date table to fill in missing start hours and values with zero.
Replace only Count_Events with zero and all na values applied to medians.
We’ll do this for both sa_zimbabwe and zimbabwe_sa

sa_zimbabwe = sa_zimbabwe %>%
  full_join(date_table)%>%
  mutate(Direction = "SA-Zimbabwe")%>%
  mutate(across(c(Count_Events), ~replace_na(.x,0)))%>%
  arrange(StartDate, StartHour)

zimbabwe_sa = zimbabwe_sa %>%
  full_join(date_table)%>%
  mutate(Direction = "Zimbabwe-SA")%>%
  mutate(across(c(Count_Events), ~replace_na(.x,0)))%>%
  arrange(StartDate, StartHour)

Put it back together into one data set.

beitbridge_border = bind_rows(zimbabwe_sa, sa_zimbabwe)

Save the data to a new csv into our processed folder.

write_csv(beitbridge_border, "../data/processed/Beitbridge_Counts_Wait_Time_2018_2022.csv")