vignettes/intro_coronavirus_dataset.Rmd
intro_coronavirus_dataset.Rmd
The coronavirus
dataset provides a daily summary of COVID-19 cases by geographic location (i.e., country/province). It includes total daily confirmed, death, and recovery1 cases. Let’s load the dataset from the coronavirus package:
library(coronavirus)
data(coronavirus)
The dataset has the following fields:
date
- The date of the summaryprovince
- The province or state, when applicablecountry
- The country or region nameLat
- Latitude pointLong
- Longitude pointtype
- the type of case (i.e., confirmed, death)cases
- the number of daily cases (corresponding to the case type)uid
- Country codeiso2
- Officially assigned country code identifiers with two-letteriso3
- Officially assigned country code identifiers with there-lettercode3
- UN country codecombined_key
- Country and province (if applicable)population
- Country or province populationcontinent_name
- Continent namecontinent_code
- Continent codeWe can use the head
and str
functions to see the structure of the dataset:
head(coronavirus)
#> date province country lat long type cases uid iso2 iso3
#> 1 2020-01-22 Alberta Canada 53.9333 -116.5765 confirmed 0 12401 CA CAN
#> 2 2020-01-23 Alberta Canada 53.9333 -116.5765 confirmed 0 12401 CA CAN
#> 3 2020-01-24 Alberta Canada 53.9333 -116.5765 confirmed 0 12401 CA CAN
#> 4 2020-01-25 Alberta Canada 53.9333 -116.5765 confirmed 0 12401 CA CAN
#> 5 2020-01-26 Alberta Canada 53.9333 -116.5765 confirmed 0 12401 CA CAN
#> 6 2020-01-27 Alberta Canada 53.9333 -116.5765 confirmed 0 12401 CA CAN
#> code3 combined_key population continent_name continent_code
#> 1 124 Alberta, Canada 4413146 North America NA
#> 2 124 Alberta, Canada 4413146 North America NA
#> 3 124 Alberta, Canada 4413146 North America NA
#> 4 124 Alberta, Canada 4413146 North America NA
#> 5 124 Alberta, Canada 4413146 North America NA
#> 6 124 Alberta, Canada 4413146 North America NA
str(coronavirus)
#> 'data.frame': 973836 obs. of 15 variables:
#> $ date : Date, format: "2020-01-22" "2020-01-23" ...
#> $ province : chr "Alberta" "Alberta" "Alberta" "Alberta" ...
#> $ country : chr "Canada" "Canada" "Canada" "Canada" ...
#> $ lat : num 53.9 53.9 53.9 53.9 53.9 ...
#> $ long : num -117 -117 -117 -117 -117 ...
#> $ type : chr "confirmed" "confirmed" "confirmed" "confirmed" ...
#> $ cases : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ uid : num 12401 12401 12401 12401 12401 ...
#> $ iso2 : chr "CA" "CA" "CA" "CA" ...
#> $ iso3 : chr "CAN" "CAN" "CAN" "CAN" ...
#> $ code3 : num 124 124 124 124 124 124 124 124 124 124 ...
#> $ combined_key : chr "Alberta, Canada" "Alberta, Canada" "Alberta, Canada" "Alberta, Canada" ...
#> $ population : num 4413146 4413146 4413146 4413146 4413146 ...
#> $ continent_name: chr "North America" "North America" "North America" "North America" ...
#> $ continent_code: chr "NA" "NA" "NA" "NA" ...
1 Recovery data is discontinued from Aug 5th, please see the following issue for more details.
In the example below, we will use the dplyr and tidyr packages to query, transform, reshape, and keep the data tidy, the plotly package to plot the data, and the DT package to view it:
Let’s start with summarizing the total number of cases by type as of 2023-03-09 and then plot it:
total_cases <- coronavirus %>%
filter(type != "recovery") %>%
group_by(type) %>%
summarise(cases = sum(cases)) %>%
mutate(type = factor(type, levels = c("confirmed", "death")))
total_cases
#> # A tibble: 2 × 2
#> type cases
#> <fct> <dbl>
#> 1 confirmed 676570149
#> 2 death 6881802
Likewise, we can summarise the data by continent using the continent_name
field:
coronavirus %>%
filter(type != "recovery") %>%
group_by(type, continent_name) %>%
summarise(cases = sum(cases), .groups = "drop") %>%
mutate(type = factor(type, levels = c("confirmed", "death"))) %>%
pivot_wider(names_from = type, values_from = cases) %>%
mutate(death_rate = death / confirmed) %>%
filter(!is.na(continent_name)) %>%
arrange(-death_rate) %>%
datatable(rownames = FALSE,
colnames = c("Continent", "Confrimed Cases", "Death Cases","Death Rate %")) %>%
formatPercentage("death_rate", 2)
You can use those numbers to derive the current worldwide death rate (percentage):
round(100 * total_cases$cases[2] / total_cases$cases[1], 2)
#> [1] 1.02
Let’s group the data by the date
and case type
fields and aggregate the total cases (confirmed and death cases) to the worldwide level and plot the two side by side:
df <- coronavirus %>%
filter(type != "recovery") %>%
group_by(date,type) %>%
summarise(total = sum(cases), .groups = "drop")
p_1 <- plot_ly(data = df %>% filter(type == "confirmed"),
x = ~ date,
y = ~ total,
name = "Confirmed",
type = "scatter",
mode = "line") %>%
layout(yaxis = list(title = "Cases"),
xaxis = list(title = ""))
p_2 <- plot_ly(data = df %>% filter(type == "death"),
x = ~ date,
y = ~ total,
name = "Death",
line = list(color = "red"),
type = "scatter",
mode = "line") %>%
layout(yaxis = list(title = "Cases"),
xaxis = list(title = "Source: Johns Hopkins University Center for Systems Science and Engineering"))
subplot(p_1, p_2, nrows = 2,
titleX = TRUE,
titleY = TRUE) %>%
layout(title = "Worldwide - Daily Confirmed and Death Cases",
margin = list(t = 60, b = 60, l = 40, r = 40),
legend = list(x = 0.05, y = 1)
)
The next table provides an overview of the ten countries with the highest confirmed cases. We will use the datatable
function from the DT package to view the table:
confirmed_country <- coronavirus %>%
filter(type == "confirmed") %>%
group_by(country) %>%
summarise(total_cases = sum(cases)) %>%
mutate(perc = total_cases / sum(total_cases)) %>%
arrange(-total_cases)
confirmed_country %>%
head(10) %>%
datatable(rownames = FALSE,
colnames = c("Country", "Cases", "Perc of Total")) %>%
formatPercentage("perc", 2)
The next plot summarize the distribution of confirmed cases by country:
conf_df <- coronavirus %>%
filter(type == "confirmed") %>%
group_by(country) %>%
summarise(total_cases = sum(cases)) %>%
arrange(-total_cases) %>%
mutate(parents = "Confirmed") %>%
ungroup()
plot_ly(data = conf_df,
type= "treemap",
values = ~total_cases,
labels= ~ country,
parents= ~parents,
domain = list(column=0),
name = "Confirmed",
textinfo="label+value+percent parent")
Similarly, we can use the pivot_wider
function from the tidyr package (in addition to the dplyr functions we used above) to get an overview of the confirmed and death cases and calculate the death rates:
coronavirus %>%
filter(country != "Others") %>%
group_by(country, type) %>%
summarise(total_cases = sum(cases)) %>%
pivot_wider(names_from = type, values_from = total_cases) %>%
arrange(- confirmed) %>%
mutate(death_rate = death / confirmed) %>%
datatable(rownames = FALSE,
colnames = c("Country", "Confirmed","Death", "Death Rate")) %>%
formatPercentage("death_rate", 2)