Version 0.2.0 of the coronavirus R data package was pushed today to CRAN. The coronavirus package provides a tidy format for Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) Coronavirus dataset. Version 0.2.0 catch up with the significant changes in the data that took place since the initial release on February 24, changing the package status from experimental to maturing.

Additional resources:

Key changes and new features

Below are the main updates for version 0.2.0:

  • Columns names - updating the geographic location fields with the changes in the raw data:
    • Province.State changed to province (US states removed from the raw data)
    • Country.Region changed to country
  • The data on the Github version is automatically get updated on a daily basis with Github Actions
  • update_dataset function- enables to update the installed version with new data that available on the Github version, more details below
  • The covid_south_korea and covid_iran that were avialble on the dev version were removed from the package and moved to new package covid19wiki, for now available only on Github

Data structure

The coronavirus dataset using long format and it has the following fields:

  • date - The date of the summary
  • province - The province or state, when applicable
  • country - The country or region name
  • lat - Latitude point
  • long - Longitude point
  • type - the type of case (i.e., confirmed, death)
  • cases - the number of daily cases (corresponding to the case type)
library(coronavirus)

head(coronavirus)
##         date province     country lat long      type cases
## 1 2020-01-22          Afghanistan  33   65 confirmed     0
## 2 2020-01-23          Afghanistan  33   65 confirmed     0
## 3 2020-01-24          Afghanistan  33   65 confirmed     0
## 4 2020-01-25          Afghanistan  33   65 confirmed     0
## 5 2020-01-26          Afghanistan  33   65 confirmed     0
## 6 2020-01-27          Afghanistan  33   65 confirmed     0

Keep the data updated

The coronavirus package provides data for an ongoing event that gets updated on a daily basis. In order to enable users to update the CRAN installed version with the most recent data available on the Github version, I created the update_dataset function. This function check if new data available on the Github version and update the package if needed:

update_dataset(silence = TRUE)

The silence argument if TRUE, will automatically install updates without prompt question (default is FALSE). More details available on the following vignette.

In order to make the new data available, you will have to restart your R session.

summarising and visualizing

Here are some examples for summarising and visualizing of the data with the use of the dplyr, tidyr, and plotly packages.

library(dplyr)
library(tidyr)
library(plotly)

Cases summary

We will start with grouping the dataset by case type and calculate the current worldwide total active cases, and the recovery and death rates:

total_cases <- coronavirus %>% 
  group_by(type) %>%
  summarise(cases = sum(cases)) %>%
  mutate(type = factor(type, levels = c("confirmed", "death", "recovered")))

total_cases
## # A tibble: 3 x 2
##   type        cases
##   <fct>       <int>
## 1 confirmed 4261747
## 2 death      291942
## 3 recovered 1493414

The total active cases are the difference between the total confirmed cases and the total recovered and death cases:

total_cases$cases[1] - total_cases$cases[2] - total_cases$cases[3]
## [1] 2476391

The worldwide recovery rate is:

round(100 * total_cases$cases[3] / total_cases$cases[1], 2)
## [1] 35.04

And worldwide death rate is:

round(100 * total_cases$cases[2] / total_cases$cases[1], 2)
## [1] 6.85

The following plot presents the cases (active, recovered, and death) distribution over time:

coronavirus %>% 
  group_by(type, date) %>%
  summarise(total_cases = sum(cases)) %>%
  pivot_wider(names_from = type, values_from = total_cases) %>%
  arrange(date) %>%
  mutate(active = confirmed - death - recovered) %>%
  mutate(active_total = cumsum(active),
                recovered_total = cumsum(recovered),
                death_total = cumsum(death)) %>%
  plot_ly(x = ~ date,
                  y = ~ active_total,
                  name = 'Active', 
                  fillcolor = '#1f77b4',
                  type = 'scatter',
                  mode = 'none', 
                  stackgroup = 'one') %>%
  add_trace(y = ~ death_total, 
             name = "Death",
             fillcolor = '#E41317') %>%
  add_trace(y = ~recovered_total, 
            name = 'Recovered', 
            fillcolor = 'forestgreen') %>%
  layout(title = "Distribution of Covid19 Cases Worldwide",
         legend = list(x = 0.1, y = 0.9),
         yaxis = list(title = "Number of Cases"),
         xaxis = list(title = "Source: Johns Hopkins University Center for Systems Science and Engineering"))

Distribution of confirmed cases by country

The next plot summarize the distribution of confrimed cases by country with the use of the treemap plot:

conf_df <- coronavirus %>% 
  filter(type == "confirmed") %>%
  group_by(country) %>%
  summarise(total_cases = sum(cases)) %>%
  arrange(-total_cases) %>%
  mutate(parents = "Confirmed") %>%
  ungroup() 
  
  plot_ly(data = conf_df,
          type= "treemap",
          values = ~total_cases,
          labels= ~ country,
          parents=  ~parents,
          domain = list(column=0),
          name = "Confirmed",
          textinfo="label+value+percent parent")

Package contributers

I would like to thank all the people that contributed to the package development and asked questions, report, and filed issues about issues with the data.

A special thanks for Amanda Dobbyn (@dobbleobble) and Jarrett Byrnes (@jebyrnes) for their pull request and suggestion that lead for the kick of the covid19R proejct, and to Mine Cetinkaya-Rundel (@minebocek) for providing a better format for the dataset documenation!