The covid19sf_population
dataset provides a daily summary of COVID-19 cases by different demographic types. That includes the following groups:
Note: Unfortunately, the groups are separated, and there is no option to run cross-group analysis (e.g., distribution by age and gender, etc.).
Below are some examples for exploring the dataset by a specific demographic group using the characteristic_type
variable to filter the dataset.
To get cases by age group we will filter the dataset to Age Group
:
library(dplyr)
df_age <- covid19sf_population %>%
filter(characteristic_type == "Age Group")
head(df_age)
#> specimen_collection_date characteristic_type characteristic_group
#> 1 2020-03-03 Age Group 0-4
#> 2 2020-03-03 Age Group 5-11
#> 3 2020-03-03 Age Group 12-17
#> 4 2020-03-03 Age Group 18-20
#> 5 2020-03-03 Age Group 21-24
#> 6 2020-03-03 Age Group 25-29
#> characteristic_group_sort_order new_cases cumulative_cases
#> 1 1 NA NA
#> 2 2 NA NA
#> 3 3 NA NA
#> 4 4 NA NA
#> 5 5 NA NA
#> 6 6 NA NA
#> population_estimate
#> 1 39353
#> 2 44153
#> 3 34664
#> 4 20407
#> 5 39944
#> 6 100792
The age groups on this dataset are:
In the following example, we will use the plotly package to visualize the daily new cases distributions by age group. First, let’s sort the age group and set the characteristic_group
variable as ordered variable:
library(plotly)
age_order <- df_age %>%
select(characteristic_group, characteristic_group_sort_order) %>%
distinct() %>%
arrange(characteristic_group_sort_order)
df_age$characteristic_group <- factor(df_age$characteristic_group, levels = age_order$characteristic_group)
Next, we will use the box
plot option to create a box-plot by age group:
plot_ly(df_age,
color = ~ characteristic_group,
y = ~ new_cases,
boxpoints = "all",
jitter = 0.3,
pointpos = -1.8,
type = "box" ) %>%
layout(title = "Distribution of Daily New COVID-19 Cases in San Francisco by Age Group",
yaxis = list(title = "Number of Cases"),
xaxis = list(title = "Source: San Francisco Department of Public Health"),
legend = list(x = 0.9, y = 0.9),
margin = list(t = 60, b = 60, l = 60, r = 60))
As shown in the box-plot above, the distribution of cases for age groups 20 to 29, 30 to 39, and 40 to 49 is relevantly wider with respect to other age groups. It is hard to conclude about age group distribution without some information about the overall population proportion of each age group. This information can be obtained from the population_estimate
variable.
The next plot describes the distribution of the cumulative cases by age group as of the most recent date in the data:
df_age %>%
filter(specimen_collection_date == max(specimen_collection_date)) %>%
plot_ly(values = ~ cumulative_cases,
labels = ~ characteristic_group,
type = "pie",
textposition = 'inside',
textinfo = 'label+percent',
insidetextfont = list(color = '#FFFFFF'),
hoverinfo = 'text',
text = ~paste(" Age Group:", characteristic_group, "<br>",
"Total:", cumulative_cases, "<br>",
"Population Estimation:", population_estimate,
paste("(",round(100* cumulative_cases/population_estimate, 1) ,"%)", sep = ""))) %>%
layout(title = ~ paste("Total Cases Dist. by Age Group as of", max(specimen_collection_date)),
margin = list(t = 60, b = 20, l = 30, r = 60))
Similarly, we can review the distribution of cases by gender group:
df_gender <- covid19sf_population %>%
filter(characteristic_type == "Gender")
head(df_gender)
#> specimen_collection_date characteristic_type characteristic_group
#> 1 2020-03-03 Gender Female
#> 2 2020-03-03 Gender Male
#> 3 2020-03-03 Gender Trans Female
#> 4 2020-03-03 Gender Trans Male
#> 5 2020-03-03 Gender Other
#> 6 2020-03-03 Gender Unknown
#> characteristic_group_sort_order new_cases cumulative_cases
#> 1 1 NA NA
#> 2 2 NA NA
#> 3 3 NA NA
#> 4 4 NA NA
#> 5 5 NA NA
#> 6 6 NA NA
#> population_estimate
#> 1 NA
#> 2 NA
#> 3 NA
#> 4 NA
#> 5 NA
#> 6 NA
The gender group has the following categories:
unique(df_gender$characteristic_group)
#> [1] "Female" "Male" "Trans Female" "Trans Male" "Other"
#> [6] "Unknown"
The following table provides the cases cumulative distribution of cases by gender. We will use the tidyr package to spread the data by gender group with the pivot_wider
function::
library(tidyr)
df_gender %>%
filter(specimen_collection_date == max(specimen_collection_date)) %>%
select(specimen_collection_date, characteristic_group, cumulative_cases) %>%
pivot_wider(names_from = characteristic_group, values_from = cumulative_cases)
#> # A tibble: 1 × 7
#> specimen_collection_da… Female Male `Trans Female` `Trans Male` Other Unknown
#> <date> <int> <int> <int> <int> <int> <int>
#> 1 2021-12-11 25518 29303 38 14 183 185
In the next example, we will plot the cumulative confirmed cases by race and ethnicity group:
covid19sf_population %>%
filter(characteristic_type == "Race/Ethnicity") %>%
arrange(specimen_collection_date) %>%
plot_ly(x = ~ specimen_collection_date,
y = ~ cumulative_cases,
type = 'scatter',
mode = 'none',
color = ~characteristic_group,
stackgroup = 'one') %>%
layout(title = "San Francisco Total COVID-19 Confirmed Cases Dist. by Race and Ethnicity",
legend = list(x = 0.05, y = 0.9),
yaxis = list(title = "Number of Cases", tickformat = ".0f"),
xaxis = list(title = "Source: San Francisco Department of Public Health"),
hovermode = "compare")
The Homelessness
group provides information about number of new and cumulative COVID-19 cases of homeless in San Francisco:
The following plot describe the daily number of new COVID-19 cases:
covid19sf_population %>%
filter(characteristic_type == "Homelessness") %>%
plot_ly(x = ~ specimen_collection_date,
y = ~ new_cases,
color = ~ characteristic_group,
type = "scatter",
mode = "lines") %>%
layout(title = "Confirmed New COVID-19 Cases by Housing Status",
yaxis = list(title = "New Cases"),
xaxis = list(title = "Source: San Francisco Department of Public Health"),
hovermode = "compare")