library(readr)
library(tidyverse)
library(skimr)
library(lubridate)
Beitbridge Truck GPS Data
This notebook is cleaning the GPS data obtained by World Bank for future modelling and analysis.
Data Cleaning
This notebook is for checking data quality on the GPS data
Subsetting the data for our Study Period (2018-2022)
Adding zeros and na values for completenesss since not every hour or every day is represented consistently in this data.
Load Libraries
We’ll be using tidyverse
packages to read in and manipulate our data.
readr
is for reading in tabular dataskimr
provides a quick summary of tabular datalubridate
is for working with date formats and time series
Read in the raw data:
= read_csv("../data/raw/Beitbridge_Border_2017_02_01_to_2023_02_28.csv") gps_data
Remove redundant field names and grand total from rest of table.
#fix column names
= gps_data %>%
gps_data rename_with(~ str_remove(., "Border_Crossing_"), everything())
# remove grand total row
= gps_data[1,]
grand_total = gps_data[-1,] gps_data
We can use skim to get a quick overview of the data:
skim(gps_data)
Name | gps_data |
Number of rows | 68318 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 2 |
logical | 1 |
numeric | 6 |
POSIXct | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Direction | 0 | 1 | 11 | 12 | 0 | 4 | 0 |
GeozoneName | 0 | 1 | 17 | 17 | 0 | 1 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
IsGrandTotalRowTotal | 0 | 1 | 0 | FAL: 68318 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ProjectID | 0 | 1 | 1.30 | 0.71 | 1 | 1.00 | 1.00 | 1.00 | 3.0 | ▇▁▁▁▂ |
StartHour | 0 | 1 | 11.41 | 5.96 | 0 | 7.00 | 12.00 | 16.00 | 23.0 | ▅▇▆▇▅ |
Count_Events | 0 | 1 | 3.17 | 2.77 | 1 | 1.00 | 2.00 | 4.00 | 52.0 | ▇▁▁▁▁ |
Median_Minutes | 0 | 1 | 1523.23 | 1773.96 | 10 | 335.00 | 1029.37 | 2096.00 | 42633.0 | ▇▁▁▁▁ |
Bottom_10__Minutes | 0 | 1 | 1114.86 | 1510.78 | 10 | 235.60 | 634.70 | 1394.60 | 42633.0 | ▇▁▁▁▁ |
Top_10__Minutes | 0 | 1 | 2227.06 | 2460.80 | 10 | 497.02 | 1416.00 | 3074.45 | 53702.6 | ▇▁▁▁▁ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
StartDate | 0 | 1 | 2017-02-01 | 2023-02-28 | 2019-07-28 | 2210 |
Cut the data for our Study Period
Remove extra fields
Format StartDate as date
Replace spaces between Direction field for consistency
Split our data set into two separate sets:
sa_zimbabwe
for those travelling in the SA –> Zimbabwe Directionzimbabwe_sa
for those travelling Zimbabwe –> SA Direction
= gps_data%>%
gps_data filter(between(year(StartDate), 2018, 2022))%>%
select(-c(IsGrandTotalRowTotal, GeozoneName, ProjectID))%>%
mutate(StartDate = as_date(StartDate),
Direction = str_replace_all(Direction," ",""))
= gps_data %>%
sa_zimbabwe filter(Direction == "SA-Zimbabwe")
= gps_data %>%
zimbabwe_sa filter(Direction == "Zimbabwe-SA")
Data Quality
Start_hour
- is not consistent need to add zeros to count events and other fields.Create a new
date_table
containing all of the dates and hours for our study period.
#YMD
#Hours 0 to 23
= ymd_hm("2018-01-01 00:00")
start_date = ymd_hm("2022-12-31 23:00")
end_date
= data.frame(StartDate = seq(start_date, end_date, by="hour"))
date_table
= date_table %>%
date_table mutate(StartHour = hour(StartDate))%>%
mutate(StartDate = as_date(StartDate))
Join our gps data from each subset to our date table to fill in missing start hours and values with zero.
Replace only Count_Events with zero and all
na
values applied to medians.We’ll do this for both
sa_zimbabwe
andzimbabwe_sa
= sa_zimbabwe %>%
sa_zimbabwe full_join(date_table)%>%
mutate(Direction = "SA-Zimbabwe")%>%
mutate(across(c(Count_Events), ~replace_na(.x,0)))%>%
arrange(StartDate, StartHour)
= zimbabwe_sa %>%
zimbabwe_sa full_join(date_table)%>%
mutate(Direction = "Zimbabwe-SA")%>%
mutate(across(c(Count_Events), ~replace_na(.x,0)))%>%
arrange(StartDate, StartHour)
Put it back together into one data set.
= bind_rows(zimbabwe_sa, sa_zimbabwe) beitbridge_border
Save the data to a new csv into our processed
folder.
write_csv(beitbridge_border, "../data/processed/Beitbridge_Counts_Wait_Time_2018_2022.csv")