This vignette provides an overview to the covid19_vaccine
dataset.
Before getting started, please note the following:
Data sources:
Let’s start by loading the packages we will use in this vignette:
We load the dataset from the coronavirus package, and use the dplyr and plotly packages to manipulate and plot the data. Let’s load the data:
data("covid19_vaccine")
head(covid19_vaccine)
#> date country_region continent_name continent_code combined_key
#> 1 2020-12-29 Austria Europe EU Austria
#> 2 2020-12-29 Bahrain Asia AS Bahrain
#> 3 2020-12-29 Belarus Europe EU Belarus
#> 4 2020-12-29 Belgium Europe EU Belgium
#> 5 2020-12-29 Canada North America NA Canada
#> 6 2020-12-29 Chile South America SA Chile
#> doses_admin people_at_least_one_dose population uid iso2 iso3 code3 fips
#> 1 2123 2123 9006400 40 AT AUT 40 <NA>
#> 2 55014 55014 1701583 48 BH BHR 48 <NA>
#> 3 0 0 9449321 112 BY BLR 112 <NA>
#> 4 340 340 11589616 56 BE BEL 56 <NA>
#> 5 59079 59078 37855702 124 CA CAN 124 <NA>
#> 6 NA NA 19116209 152 CL CHL 152 <NA>
#> lat long
#> 1 47.5162 14.550100
#> 2 26.0275 50.550000
#> 3 53.7098 27.953400
#> 4 50.8333 4.469936
#> 5 60.0000 -95.000000
#> 6 -35.6751 -71.543000
The covid19_vaccine
dataset provides time-series data on the vaccination progress by country or province (if applicable). Likewise, the coronavirus
dataset, the COVID19 vaccine raw data is collected by Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE). The covid19_vaccine
data includes the following fields:
date
- Data collection date in YYYY-MM-DD formatcountry_region
- Country or region namecontinent_name
- Continent namecontinent_code
- Continent codecombined_key
- Country and province (if applicable)doses_admin
- Cumulative number of doses administered. When a vaccine requires multiple doses, each one is counted independentlypeople_at_least_one_dose
- Cumulative number of people who received at least one vaccine dose. When the person receives a prescribed second dose, it is not counted twicepopulation
- Country or province populationuid
- Country codeiso2
- Officially assigned country code identifiers with two-letteriso3
- Officially assigned country code identifiers with three-lettercode3
- UN country codefips
- Federal Information Processing Standards code that uniquely identifies counties within the USAlat
- Latitudelong
- LongitudeNote: The country / province code fields (e.g., ios2
, ios3
, etc.) and population were merged with the raw data
The data in the package is up-to-date with the last date the package was updated on CRAN. To get the most recent data available on the source (data get updated daily), you can use one of the following methods:
update_dataset
which will re-install the package from the package repository. Please see the vignette for more details
library(readr)
vaccine_df <- read_csv(file = "https://raw.githubusercontent.com/RamiKrispin/coronavirus/main/csv/covid19_vaccine.csv",
col_types = cols(
date = col_date(format = ""),
country_region = col_character(),
continent_name = col_character(),
continent_code = col_character(),
combined_key = col_character(),
doses_admin = col_number(),
people_at_least_one_dose = col_number(),
population = col_number(),
uid = col_number(),
iso2 = col_character(),
iso3 = col_character(),
code3 = col_number(),
fips = col_logical(),
lat = col_number(),
long = col_number()
))
# Fixing the continent code field - changing NA to "NA" for North America
vaccine_df$continent_code <- ifelse(is.na(vaccine_df$continent_code) & vaccine_df$continent_name == "North America", "NA", vaccine_df$continent_code)
head(vaccine_df)
#> # A tibble: 6 × 15
#> date country_region continent_name continent_code combined_key
#> <date> <chr> <chr> <chr> <chr>
#> 1 2020-12-29 Austria Europe EU Austria
#> 2 2020-12-29 Bahrain Asia AS Bahrain
#> 3 2020-12-29 Belarus Europe EU Belarus
#> 4 2020-12-29 Belgium Europe EU Belgium
#> 5 2020-12-29 Canada North America NA Canada
#> 6 2020-12-29 Chile South America SA Chile
#> # … with 10 more variables: doses_admin <dbl>, people_at_least_one_dose <dbl>,
#> # population <dbl>, uid <dbl>, iso2 <chr>, iso3 <chr>, code3 <dbl>,
#> # fips <lgl>, lat <dbl>, long <dbl>
The plan, in the next version, is to add a similar function to the refresh_coronavirus_jhu
function to pull the vaccine dataset.
The dataset includes two cumulative counts of vaccine doses received:
people_at_least_one_dose
- counts both single and multiple doses of vaccine as one dosedoses_admin
- counts each dose received separatelySo, for example, if a person received a multi doses vaccine such as the Pfizer or Moderna vaccine, here is the count by each method:
people_at_least_one_dose
will count it as one dose, and the doses_admin
will count it as two
We can assume that the last case is rare; therefore, the people_at_least_one_dose
could be a good proxy for fully vaccinated people. Let’s see the differences between the two by using the US vaccine data:
covid19_vaccine |>
filter(country_region == "US") |>
select(date, doses_admin, total = people_at_least_one_dose) |>
arrange(date) |>
plot_ly() |>
add_lines(x = ~ date,
y = ~ doses_admin,
name = "Total Doses") |>
add_lines(x = ~ date,
y = ~ total,
name = "Recevied Vaccine with One/Two Doses") |>
layout(title = "US Cumulative Number of Doses by Type",
legend = list(x = 0.05, y = 0.95),
margin = list(l = 50, r = 50, b = 50, t = 90),
yaxis = list(title = "Number of Doses"),
xaxis = list(title = ""))
Here are some thoughts about how to leverage this dataset:
people_at_least_one_dose
could provide a good proxy for the total number of fully vaccinated peopledoses_admin
column to calculate some stats about the total number of doses provided in a given period or overall.Before going back and diving into the US data, let’s summarize the data by country. Since the data is cumulative, let’s take a snapshot of the most recent date in the data and calculate the ratio between the total number of doses and the size of the population:
max(covid19_vaccine$date)
#> [1] "2023-03-09"
df_summary <- covid19_vaccine |>
filter(date == max(date)) |>
select(date, country_region, doses_admin, total = people_at_least_one_dose, population, continent_name) |>
mutate(doses_pop_ratio = doses_admin / population,
total_pop_ratio = total / population) |>
arrange(- total)
head(df_summary, 10)
#> date country_region doses_admin total population continent_name
#> 1 2023-03-09 World NA 5549369464 NA <NA>
#> 2 2023-03-09 China NA 1310292000 1404676330 Asia
#> 3 2023-03-09 India NA 1027379945 1380004385 Asia
#> 4 2023-03-09 US 672076105 269554116 329466283 North America
#> 5 2023-03-09 Indonesia 444303130 203657535 273523621 Asia
#> 6 2023-03-09 Brazil 502262440 189395212 212559409 South America
#> 7 2023-03-09 Pakistan 333759565 162219717 220892331 Asia
#> 8 2023-03-09 Bangladesh 355143411 151190373 164689383 Asia
#> 9 2023-03-09 Japan 382415648 104675948 126476458 Asia
#> 10 2023-03-09 Mexico 225063079 99071001 127792286 North America
#> doses_pop_ratio total_pop_ratio
#> 1 NA NA
#> 2 NA 0.9328071
#> 3 NA 0.7444759
#> 4 2.039893 0.8181539
#> 5 1.624368 0.7445702
#> 6 2.362927 0.8910225
#> 7 1.510960 0.7343837
#> 8 2.156444 0.9180335
#> 9 3.023611 0.8276319
#> 10 1.761163 0.7752502
It would be interesting to plot the ratio between the number of overall doses received and the size of the population. Let’s filter missing values and plot the ratio:
df_summary |>
filter(country_region != "World",
!is.na(population),
!is.na(total)) |>
plot_ly(x = ~ population,
y = ~ total,
text = ~ paste("Country: ", country_region, "<br>",
"Population: ", population, "<br>",
"Total Doses: ", total, "<br>",
"Ratio: ", round(total_pop_ratio, 2),
sep = ""),
type = "scatter",
mode = "markers") |>
layout(title = "Total Doses vs. Population",
margin = list(l = 50, r = 50, b = 60, t = 70),
yaxis = list(title = "Number of Doses"),
xaxis = list(title = "Population Size"))
As can see, China and India are skewing the plot. We can re-scale the plot with log transformation of the number of doses and population size using Plotly’s built-in log transformation. Let’s also color by continent and add diagonal lines to represent the 1 to 1 relationship between the median and first quartile:
# Setting the diagonal lines range
line_start <- 10000
line_end <- 1500 * 10 ^ 6
# Filter the data
d <- df_summary |>
filter(country_region != "World",
!is.na(population),
!is.na(total))
# Replot it
plot_ly() |>
add_markers(x = d$population,
y = d$total,
text = ~ paste("Country: ", d$country_region, "<br>",
"Population: ", d$population, "<br>",
"Total Doses: ", d$total, "<br>",
"Ratio: ", round(d$total_pop_ratio, 2),
sep = ""),
color = d$continent_name,
type = "scatter",
mode = "markers") |>
add_lines(x = c(line_start, line_end),
y = c(line_start, line_end),
showlegend = FALSE,
line = list(color = "gray", width = 0.5)) |>
add_lines(x = c(line_start, line_end),
y = c(0.5 * line_start, 0.5 * line_end),
showlegend = FALSE,
line = list(color = "gray", width = 0.5)) |>
add_lines(x = c(line_start, line_end),
y = c(0.25 * line_start, 0.25 * line_end),
showlegend = FALSE,
line = list(color = "gray", width = 0.5)) |>
add_annotations(text = "1:1",
x = log10(line_end * 1.25),
y = log10(line_end * 1.25),
showarrow = FALSE,
textangle = -25,
font = list(size = 8),
xref = "x",
yref = "y") |>
add_annotations(text = "1:2",
x = log10(line_end * 1.25),
y = log10(0.5 * line_end * 1.25),
showarrow = FALSE,
textangle = -25,
font = list(size = 8),
xref = "x",
yref = "y") |>
add_annotations(text = "1:4",
x = log10(line_end * 1.25),
y = log10(0.25 * line_end * 1.25),
showarrow = FALSE,
textangle = -25,
font = list(size = 8),
xref = "x",
yref = "y") |>
add_annotations(text = "Source: Johns Hopkins University - Centers for Civic Impact",
showarrow = FALSE,
xref = "paper",
yref = "paper",
x = -0.05, y = - 0.33) |>
layout(title = "Covid19 Vaccine - Total Doses vs. Population Ratio (Log Scale)",
margin = list(l = 50, r = 50, b = 90, t = 70),
yaxis = list(title = "Number of Doses",
type = "log"),
xaxis = list(title = "Population Size",
type = "log"),
legend = list(x = 0.75, y = 0.05))
Another way to represent the data is by using a box plot to plot the total doses and population ration distribution by continent:
plot_ly(d,
y = ~ total_pop_ratio,
color = ~ continent_name,
type = "box", boxpoints = "all", jitter = 0.3,
pointpos = -1.8
) |>
layout(title = "Distribution of Total Doses Admited vs Population Size Ratio",
margin = list(l = 50, r = 50, b = 60, t = 60),
yaxis = list(title = "Total Doses/Population Ratio"),
xaxis = list(title = "Continent")
)
Let’s now return to the number of doses in the US and plot the daily number of new cases with the daily number of vaccine doses received. We will use the people_at_least_one_dose
column as a proxy for the number of people vaccinated. First, let’s reformat the data from cumulative to daily by subtracting from the cumulative count its first lag:
us_vaccine <- covid19_vaccine |>
filter(country_region == "US") |>
select(date, total_doses = people_at_least_one_dose) |>
mutate(total_doses_lag1 = lag(total_doses, 1),
daily_doses = total_doses - total_doses_lag1)|>
select(date, daily_doses) |>
arrange(date)
head(us_vaccine)
#> date daily_doses
#> 1 2020-12-29 NA
#> 2 2020-12-30 708712
#> 3 2020-12-31 827734
#> 4 2021-01-01 415260
#> 5 2021-01-02 119786
#> 6 2021-01-03 259292
Let’s now plot the daily doses:
plot_ly(us_vaccine) |>
add_lines(x = ~ date,
y = ~ daily_doses) |>
layout(title = "US Daily Vaccine Doses Received",
margin = list(l = 50, r = 50, b = 60, t = 60),
yaxis = list(title = "Total Doses"),
xaxis = list(title = ""))
We can now add the Covid19 cases by loading the dataset using the refresh_coronavirus_jhu
function, filter it (to US), and merge it with the us_vaccine
dataset we created above:
covid19_cases <- refresh_coronavirus_jhu()
#> Loading 2020 data
#> Loading 2021 data
#> Loading 2022 data
#> Loading 2023 data
head(covid19_cases)
#> date location location_type location_code location_code_type
#> 1 2021-12-31 Afghanistan country AF iso_3166_2
#> 2 2020-03-24 Afghanistan country AF iso_3166_2
#> 3 2022-11-02 Afghanistan country AF iso_3166_2
#> 4 2020-03-23 Afghanistan country AF iso_3166_2
#> 5 2021-08-09 Afghanistan country AF iso_3166_2
#> 6 2023-03-02 Afghanistan country AF iso_3166_2
#> data_type value lat long
#> 1 cases_new 28 33.93911 67.70995
#> 2 recovered_new 0 33.93911 67.70995
#> 3 cases_new 98 33.93911 67.70995
#> 4 recovered_new 0 33.93911 67.70995
#> 5 deaths_new 28 33.93911 67.70995
#> 6 cases_new 18 33.93911 67.70995
us_cases <- covid19_cases |>
filter(location == "US",
data_type == "cases_new") |>
select(date, cases = value)
head(us_cases)
#> date cases
#> 1 2021-08-07 63881
#> 2 2021-06-20 4925
#> 3 2021-08-09 158431
#> 4 2021-05-18 29008
#> 5 2021-08-08 35686
#> 6 2021-08-04 102507
Let’s now merge the two datasets and plot them:
df <- us_vaccine |>
left_join(us_cases, by = "date") |>
select(date, daily_doses, daily_cases = cases) |>
arrange(date)
head(df)
#> date daily_doses daily_cases
#> 1 2020-12-29 NA 206657
#> 2 2020-12-30 708712 218774
#> 3 2020-12-31 827734 313782
#> 4 2021-01-01 415260 177570
#> 5 2021-01-02 119786 272445
#> 6 2021-01-03 259292 203162
plot_ly(df) |>
add_lines(x = ~ date,
y = ~ daily_cases,
name = "Daily Cases") |>
add_lines(x = ~ date,
y = ~ daily_doses,
name = "Daily Doses") |>
layout(title = "US Daily New Cases vs. Doses of Vaccine",
yaxis = list(title = "New Cases/Doses"),
xaxis = list(title = ""),
legend = list(x = 0.75, y = 0.95),
margin = list(l = 50, r = 50, b = 50, t = 90))