This vignette provides an overview to the covid19_vaccine dataset.

Data

Before getting started, please note the following:

  • It may be challenging to compare countries as each may use different count methods for the given doses.
  • In addition, the dose summary includes multiple types of vaccines, and there is no information if the doses are a single or multiple doses series.

Data sources:

Load the data

Let’s start by loading the packages we will use in this vignette:

We load the dataset from the coronavirus package, and use the dplyr and plotly packages to manipulate and plot the data. Let’s load the data:

data("covid19_vaccine")

head(covid19_vaccine)
#>         date country_region continent_name continent_code combined_key
#> 1 2020-12-29        Austria         Europe             EU      Austria
#> 2 2020-12-29        Bahrain           Asia             AS      Bahrain
#> 3 2020-12-29        Belarus         Europe             EU      Belarus
#> 4 2020-12-29        Belgium         Europe             EU      Belgium
#> 5 2020-12-29         Canada  North America             NA       Canada
#> 6 2020-12-29          Chile  South America             SA        Chile
#>   doses_admin people_at_least_one_dose population uid iso2 iso3 code3 fips
#> 1        2123                     2123    9006400  40   AT  AUT    40 <NA>
#> 2       55014                    55014    1701583  48   BH  BHR    48 <NA>
#> 3           0                        0    9449321 112   BY  BLR   112 <NA>
#> 4         340                      340   11589616  56   BE  BEL    56 <NA>
#> 5       59079                    59078   37855702 124   CA  CAN   124 <NA>
#> 6          NA                       NA   19116209 152   CL  CHL   152 <NA>
#>        lat       long
#> 1  47.5162  14.550100
#> 2  26.0275  50.550000
#> 3  53.7098  27.953400
#> 4  50.8333   4.469936
#> 5  60.0000 -95.000000
#> 6 -35.6751 -71.543000

The covid19_vaccine dataset provides time-series data on the vaccination progress by country or province (if applicable). Likewise, the coronavirus dataset, the COVID19 vaccine raw data is collected by Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE). The covid19_vaccine data includes the following fields:

  • date - Data collection date in YYYY-MM-DD format
  • country_region - Country or region name
  • continent_name - Continent name
  • continent_code - Continent code
  • combined_key - Country and province (if applicable)
  • doses_admin - Cumulative number of doses administered. When a vaccine requires multiple doses, each one is counted independently
  • people_at_least_one_dose - Cumulative number of people who received at least one vaccine dose. When the person receives a prescribed second dose, it is not counted twice
  • population - Country or province population
  • uid - Country code
  • iso2 - Officially assigned country code identifiers with two-letter
  • iso3 - Officially assigned country code identifiers with three-letter
  • code3 - UN country code
  • fips - Federal Information Processing Standards code that uniquely identifies counties within the USA
  • lat - Latitude
  • long - Longitude

Note: The country / province code fields (e.g., ios2, ios3, etc.) and population were merged with the raw data

Refresh the data

The data in the package is up-to-date with the last date the package was updated on CRAN. To get the most recent data available on the source (data get updated daily), you can use one of the following methods:

  • Use the update_dataset which will re-install the package from the package repository. Please see the vignette for more details
  • Pull the data directly from the CSV file on the repository:
library(readr)

vaccine_df <- read_csv(file = "https://raw.githubusercontent.com/RamiKrispin/coronavirus/main/csv/covid19_vaccine.csv",
                       col_types = cols(
                         date = col_date(format = ""),
                         country_region = col_character(),
                         continent_name = col_character(),
                         continent_code = col_character(),
                         combined_key = col_character(),
                         doses_admin = col_number(),
                         people_at_least_one_dose = col_number(),
                         population = col_number(),
                         uid = col_number(),
                         iso2 = col_character(),
                         iso3 = col_character(),
                         code3 = col_number(),
                         fips = col_logical(),
                         lat = col_number(),
                         long = col_number()
                       )) 

# Fixing the continent code field - changing NA to "NA" for North America
vaccine_df$continent_code <- ifelse(is.na(vaccine_df$continent_code) & vaccine_df$continent_name == "North America", "NA", vaccine_df$continent_code)

head(vaccine_df)
#> # A tibble: 6 × 15
#>   date       country_region continent_name continent_code combined_key
#>   <date>     <chr>          <chr>          <chr>          <chr>       
#> 1 2020-12-29 Austria        Europe         EU             Austria     
#> 2 2020-12-29 Bahrain        Asia           AS             Bahrain     
#> 3 2020-12-29 Belarus        Europe         EU             Belarus     
#> 4 2020-12-29 Belgium        Europe         EU             Belgium     
#> 5 2020-12-29 Canada         North America  NA             Canada      
#> 6 2020-12-29 Chile          South America  SA             Chile       
#> # … with 10 more variables: doses_admin <dbl>, people_at_least_one_dose <dbl>,
#> #   population <dbl>, uid <dbl>, iso2 <chr>, iso3 <chr>, code3 <dbl>,
#> #   fips <lgl>, lat <dbl>, long <dbl>

The plan, in the next version, is to add a similar function to the refresh_coronavirus_jhu function to pull the vaccine dataset.

Analysing the vaccine dataset

The dataset includes two cumulative counts of vaccine doses received:

  • people_at_least_one_dose - counts both single and multiple doses of vaccine as one dose
  • doses_admin - counts each dose received separately

So, for example, if a person received a multi doses vaccine such as the Pfizer or Moderna vaccine, here is the count by each method:

  • If the person completed the two doses, the people_at_least_one_dose will count it as one dose, and the doses_admin will count it as two
  • Similarly, if the person received only the first out of two doses and skipped the second, it will be counted in both variables as one

We can assume that the last case is rare; therefore, the people_at_least_one_dose could be a good proxy for fully vaccinated people. Let’s see the differences between the two by using the US vaccine data:

covid19_vaccine |> 
  filter(country_region == "US") |>
  select(date, doses_admin, total = people_at_least_one_dose) |>
  arrange(date) |>
  plot_ly() |>
  add_lines(x = ~ date,
            y = ~ doses_admin,
            name = "Total Doses") |>
  add_lines(x = ~ date,
            y = ~ total,
            name = "Recevied Vaccine with One/Two Doses") |>
  layout(title = "US Cumulative Number of Doses by Type",
         legend = list(x = 0.05, y = 0.95),
         margin = list(l = 50, r = 50, b = 50, t = 90),
         yaxis = list(title = "Number of Doses"),
         xaxis = list(title = ""))

Here are some thoughts about how to leverage this dataset:

  • While it is not entirely accurate, as mentioned above, the people_at_least_one_dose could provide a good proxy for the total number of fully vaccinated people
  • We can leverage the doses_admin column to calculate some stats about the total number of doses provided in a given period or overall.
  • Combining the vaccine and new cases datasets could provide insights into vaccine adoption and effectiveness.

Country level summary

Before going back and diving into the US data, let’s summarize the data by country. Since the data is cumulative, let’s take a snapshot of the most recent date in the data and calculate the ratio between the total number of doses and the size of the population:

max(covid19_vaccine$date)
#> [1] "2023-03-09"

df_summary <- covid19_vaccine |>
  filter(date == max(date)) |>
  select(date, country_region, doses_admin, total = people_at_least_one_dose, population, continent_name) |>
  mutate(doses_pop_ratio = doses_admin / population,
         total_pop_ratio = total / population) |>
  arrange(- total)

head(df_summary, 10)
#>          date country_region doses_admin      total population continent_name
#> 1  2023-03-09          World          NA 5549369464         NA           <NA>
#> 2  2023-03-09          China          NA 1310292000 1404676330           Asia
#> 3  2023-03-09          India          NA 1027379945 1380004385           Asia
#> 4  2023-03-09             US   672076105  269554116  329466283  North America
#> 5  2023-03-09      Indonesia   444303130  203657535  273523621           Asia
#> 6  2023-03-09         Brazil   502262440  189395212  212559409  South America
#> 7  2023-03-09       Pakistan   333759565  162219717  220892331           Asia
#> 8  2023-03-09     Bangladesh   355143411  151190373  164689383           Asia
#> 9  2023-03-09          Japan   382415648  104675948  126476458           Asia
#> 10 2023-03-09         Mexico   225063079   99071001  127792286  North America
#>    doses_pop_ratio total_pop_ratio
#> 1               NA              NA
#> 2               NA       0.9328071
#> 3               NA       0.7444759
#> 4         2.039893       0.8181539
#> 5         1.624368       0.7445702
#> 6         2.362927       0.8910225
#> 7         1.510960       0.7343837
#> 8         2.156444       0.9180335
#> 9         3.023611       0.8276319
#> 10        1.761163       0.7752502

It would be interesting to plot the ratio between the number of overall doses received and the size of the population. Let’s filter missing values and plot the ratio:

df_summary |> 
  filter(country_region != "World", 
         !is.na(population),
         !is.na(total)) |>
  plot_ly(x = ~ population,
          y = ~ total,
          text = ~ paste("Country: ", country_region, "<br>",
                         "Population: ", population, "<br>",
                         "Total Doses: ", total, "<br>",
                         "Ratio: ", round(total_pop_ratio, 2), 
                         sep = ""),
          type = "scatter",
          mode = "markers") |>
  layout(title = "Total Doses vs. Population",
         margin = list(l = 50, r = 50, b = 60, t = 70),
         yaxis = list(title = "Number of Doses"),
         xaxis = list(title = "Population Size"))

As can see, China and India are skewing the plot. We can re-scale the plot with log transformation of the number of doses and population size using Plotly’s built-in log transformation. Let’s also color by continent and add diagonal lines to represent the 1 to 1 relationship between the median and first quartile:


# Setting the diagonal lines range
line_start <- 10000
line_end <- 1500 * 10 ^ 6

# Filter the data
d <- df_summary |> 
  filter(country_region != "World", 
         !is.na(population),
         !is.na(total)) 


# Replot it
plot_ly() |>
  add_markers(x = d$population,
              y = d$total,
              text = ~ paste("Country: ", d$country_region, "<br>",
                             "Population: ", d$population, "<br>",
                             "Total Doses: ", d$total, "<br>",
                             "Ratio: ", round(d$total_pop_ratio, 2), 
                             sep = ""),
              color = d$continent_name,
              type = "scatter",
              mode = "markers") |>
  add_lines(x = c(line_start, line_end),
            y = c(line_start, line_end),
            showlegend = FALSE,
            line = list(color = "gray", width = 0.5)) |>
  add_lines(x = c(line_start, line_end),
            y = c(0.5 * line_start, 0.5 * line_end),
            showlegend = FALSE,
            line = list(color = "gray", width = 0.5)) |>
  
  add_lines(x = c(line_start, line_end),
            y = c(0.25 * line_start, 0.25 * line_end),
            showlegend = FALSE,
            line = list(color = "gray", width = 0.5)) |>
  add_annotations(text = "1:1",
                  x = log10(line_end * 1.25),
                  y = log10(line_end * 1.25),
                  showarrow = FALSE,
                  textangle = -25,
                  font = list(size = 8),
                  xref = "x",
                  yref = "y") |>
  add_annotations(text = "1:2",
                  x = log10(line_end * 1.25),
                  y = log10(0.5 * line_end * 1.25),
                  showarrow = FALSE,
                  textangle = -25,
                  font = list(size = 8),
                  xref = "x",
                  yref = "y") |>
  add_annotations(text = "1:4",
                  x = log10(line_end * 1.25),
                  y = log10(0.25 * line_end * 1.25),
                  showarrow = FALSE,
                  textangle = -25,
                  font = list(size = 8),
                  xref = "x",
                  yref = "y") |>
  add_annotations(text = "Source: Johns Hopkins University - Centers for Civic Impact",
                  showarrow = FALSE,
                  xref = "paper",
                  yref = "paper",
                  x = -0.05, y = - 0.33) |>
  layout(title = "Covid19 Vaccine - Total Doses vs. Population Ratio (Log Scale)",
         margin = list(l = 50, r = 50, b = 90, t = 70),
         yaxis = list(title = "Number of Doses",
                      type = "log"),
         xaxis = list(title = "Population Size",
                      type = "log"),
         legend = list(x = 0.75, y = 0.05))

Another way to represent the data is by using a box plot to plot the total doses and population ration distribution by continent:

plot_ly(d,
        y = ~ total_pop_ratio,
        color = ~ continent_name,
        type = "box", boxpoints = "all", jitter = 0.3,
        pointpos = -1.8
        ) |>
  layout(title = "Distribution of Total Doses Admited vs Population Size Ratio",
         margin = list(l = 50, r = 50, b = 60, t = 60),
         yaxis = list(title = "Total Doses/Population Ratio"),
         xaxis = list(title = "Continent")
    
  )

Plot US cases

Let’s now return to the number of doses in the US and plot the daily number of new cases with the daily number of vaccine doses received. We will use the people_at_least_one_dose column as a proxy for the number of people vaccinated. First, let’s reformat the data from cumulative to daily by subtracting from the cumulative count its first lag:

us_vaccine <- covid19_vaccine |> 
  filter(country_region == "US") |>
  select(date, total_doses = people_at_least_one_dose) |>
  mutate(total_doses_lag1 = lag(total_doses, 1),
         daily_doses = total_doses - total_doses_lag1)|>
  select(date, daily_doses) |>
  arrange(date) 


head(us_vaccine)
#>         date daily_doses
#> 1 2020-12-29          NA
#> 2 2020-12-30      708712
#> 3 2020-12-31      827734
#> 4 2021-01-01      415260
#> 5 2021-01-02      119786
#> 6 2021-01-03      259292

Let’s now plot the daily doses:

plot_ly(us_vaccine) |>
  add_lines(x = ~ date,
            y = ~ daily_doses)  |>
  layout(title = "US Daily Vaccine Doses Received",
         margin = list(l = 50, r = 50, b = 60, t = 60),
         yaxis = list(title = "Total Doses"),
         xaxis = list(title = ""))

We can now add the Covid19 cases by loading the dataset using the refresh_coronavirus_jhu function, filter it (to US), and merge it with the us_vaccine dataset we created above:


covid19_cases <- refresh_coronavirus_jhu()
#> Loading 2020 data
#> Loading 2021 data
#> Loading 2022 data
#> Loading 2023 data

head(covid19_cases)
#>         date    location location_type location_code location_code_type
#> 1 2021-12-31 Afghanistan       country            AF         iso_3166_2
#> 2 2020-03-24 Afghanistan       country            AF         iso_3166_2
#> 3 2022-11-02 Afghanistan       country            AF         iso_3166_2
#> 4 2020-03-23 Afghanistan       country            AF         iso_3166_2
#> 5 2021-08-09 Afghanistan       country            AF         iso_3166_2
#> 6 2023-03-02 Afghanistan       country            AF         iso_3166_2
#>       data_type value      lat     long
#> 1     cases_new    28 33.93911 67.70995
#> 2 recovered_new     0 33.93911 67.70995
#> 3     cases_new    98 33.93911 67.70995
#> 4 recovered_new     0 33.93911 67.70995
#> 5    deaths_new    28 33.93911 67.70995
#> 6     cases_new    18 33.93911 67.70995

us_cases <- covid19_cases |>
  filter(location == "US",
         data_type == "cases_new") |>
  select(date, cases = value)


head(us_cases)
#>         date  cases
#> 1 2021-08-07  63881
#> 2 2021-06-20   4925
#> 3 2021-08-09 158431
#> 4 2021-05-18  29008
#> 5 2021-08-08  35686
#> 6 2021-08-04 102507

Let’s now merge the two datasets and plot them:

df <- us_vaccine |> 
  left_join(us_cases, by = "date") |>
  select(date, daily_doses, daily_cases = cases) |>
  arrange(date)

head(df)
#>         date daily_doses daily_cases
#> 1 2020-12-29          NA      206657
#> 2 2020-12-30      708712      218774
#> 3 2020-12-31      827734      313782
#> 4 2021-01-01      415260      177570
#> 5 2021-01-02      119786      272445
#> 6 2021-01-03      259292      203162


plot_ly(df) |>
  add_lines(x = ~ date,
            y = ~ daily_cases,
            name = "Daily Cases") |>
  add_lines(x = ~ date,
            y = ~ daily_doses,
            name = "Daily Doses") |>
  layout(title = "US Daily New Cases vs. Doses of Vaccine",
         yaxis = list(title = "New Cases/Doses"),
         xaxis = list(title = ""),
         legend = list(x = 0.75, y = 0.95),
         margin = list(l = 50, r = 50, b = 50, t = 90))