library(readr)
library(tidyverse)
library(skimr)
library(lubridate)Beitbridge Truck GPS Data
This notebook is cleaning the GPS data obtained by World Bank for future modelling and analysis.
Data Cleaning
This notebook is for checking data quality on the GPS data
Subsetting the data for our Study Period (2018-2022)
Adding zeros and na values for completenesss since not every hour or every day is represented consistently in this data.
Load Libraries
We’ll be using tidyverse packages to read in and manipulate our data.
readris for reading in tabular dataskimrprovides a quick summary of tabular datalubridateis for working with date formats and time series
Read in the raw data:
gps_data = read_csv("../data/raw/Beitbridge_Border_2017_02_01_to_2023_02_28.csv")Remove redundant field names and grand total from rest of table.
#fix column names
gps_data = gps_data %>%
rename_with(~ str_remove(., "Border_Crossing_"), everything())
# remove grand total row
grand_total = gps_data[1,]
gps_data = gps_data[-1,]We can use skim to get a quick overview of the data:
skim(gps_data)| Name | gps_data |
| Number of rows | 68318 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| logical | 1 |
| numeric | 6 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Direction | 0 | 1 | 11 | 12 | 0 | 4 | 0 |
| GeozoneName | 0 | 1 | 17 | 17 | 0 | 1 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| IsGrandTotalRowTotal | 0 | 1 | 0 | FAL: 68318 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ProjectID | 0 | 1 | 1.30 | 0.71 | 1 | 1.00 | 1.00 | 1.00 | 3.0 | ▇▁▁▁▂ |
| StartHour | 0 | 1 | 11.41 | 5.96 | 0 | 7.00 | 12.00 | 16.00 | 23.0 | ▅▇▆▇▅ |
| Count_Events | 0 | 1 | 3.17 | 2.77 | 1 | 1.00 | 2.00 | 4.00 | 52.0 | ▇▁▁▁▁ |
| Median_Minutes | 0 | 1 | 1523.23 | 1773.96 | 10 | 335.00 | 1029.37 | 2096.00 | 42633.0 | ▇▁▁▁▁ |
| Bottom_10__Minutes | 0 | 1 | 1114.86 | 1510.78 | 10 | 235.60 | 634.70 | 1394.60 | 42633.0 | ▇▁▁▁▁ |
| Top_10__Minutes | 0 | 1 | 2227.06 | 2460.80 | 10 | 497.02 | 1416.00 | 3074.45 | 53702.6 | ▇▁▁▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| StartDate | 0 | 1 | 2017-02-01 | 2023-02-28 | 2019-07-28 | 2210 |
Cut the data for our Study Period
Remove extra fields
Format StartDate as date
Replace spaces between Direction field for consistency
Split our data set into two separate sets:
sa_zimbabwefor those travelling in the SA –> Zimbabwe Directionzimbabwe_safor those travelling Zimbabwe –> SA Direction
gps_data = gps_data%>%
filter(between(year(StartDate), 2018, 2022))%>%
select(-c(IsGrandTotalRowTotal, GeozoneName, ProjectID))%>%
mutate(StartDate = as_date(StartDate),
Direction = str_replace_all(Direction," ",""))
sa_zimbabwe = gps_data %>%
filter(Direction == "SA-Zimbabwe")
zimbabwe_sa = gps_data %>%
filter(Direction == "Zimbabwe-SA")Data Quality
Start_hour- is not consistent need to add zeros to count events and other fields.Create a new
date_tablecontaining all of the dates and hours for our study period.
#YMD
#Hours 0 to 23
start_date = ymd_hm("2018-01-01 00:00")
end_date = ymd_hm("2022-12-31 23:00")
date_table = data.frame(StartDate = seq(start_date, end_date, by="hour"))
date_table = date_table %>%
mutate(StartHour = hour(StartDate))%>%
mutate(StartDate = as_date(StartDate))Join our gps data from each subset to our date table to fill in missing start hours and values with zero.
Replace only Count_Events with zero and all
navalues applied to medians.We’ll do this for both
sa_zimbabweandzimbabwe_sa
sa_zimbabwe = sa_zimbabwe %>%
full_join(date_table)%>%
mutate(Direction = "SA-Zimbabwe")%>%
mutate(across(c(Count_Events), ~replace_na(.x,0)))%>%
arrange(StartDate, StartHour)zimbabwe_sa = zimbabwe_sa %>%
full_join(date_table)%>%
mutate(Direction = "Zimbabwe-SA")%>%
mutate(across(c(Count_Events), ~replace_na(.x,0)))%>%
arrange(StartDate, StartHour)Put it back together into one data set.
beitbridge_border = bind_rows(zimbabwe_sa, sa_zimbabwe)Save the data to a new csv into our processed folder.
write_csv(beitbridge_border, "../data/processed/Beitbridge_Counts_Wait_Time_2018_2022.csv")