v3_sankey.Rmd
A Sankey diagram is a common method for visualizing a numeric variable’s distribution across multiple categorical variables. The sankey_ly
function is a wrapper for the plotly Sankey functionality. It enables a fast creation of a Sankey diagram without the need for data transformation and node mapping.
The function has the following arguments:
x
- aggregate data.frame object must have at least two categorical variables and one numeric variablecat_cols
- a vector of at least two categorical columns namesnum_col
- a single numeric column nametitle
- optional, enables to set title for the plotLet’s see some use cases of the function with the sfo_passengers
dataset:
library(sfo)
data("sfo_passengers")
str(sfo_passengers)
#> 'data.frame': 50730 obs. of 12 variables:
#> $ activity_period : int 202212 202212 202212 202212 202212 202212 202212 202212 202212 202212 ...
#> $ operating_airline : chr "EVA Airways" "EVA Airways" "Emirates" "Emirates" ...
#> $ operating_airline_iata_code: chr "BR" "BR" "EK" "EK" ...
#> $ published_airline : chr "EVA Airways" "EVA Airways" "Emirates" "Emirates" ...
#> $ published_airline_iata_code: chr "BR" "BR" "EK" "EK" ...
#> $ geo_summary : chr "International" "International" "International" "International" ...
#> $ geo_region : chr "Asia" "Asia" "Middle East" "Middle East" ...
#> $ activity_type_code : chr "Deplaned" "Enplaned" "Deplaned" "Enplaned" ...
#> $ price_category_code : chr "Other" "Other" "Other" "Other" ...
#> $ terminal : chr "International" "International" "International" "International" ...
#> $ boarding_area : chr "G" "G" "A" "A" ...
#> $ passenger_count : int 12405 15151 13131 14985 2543 2883 1772 1370 2817 1987 ...
In the case of the sfo_passengers
dataset, ignoring the date indicator, we have one numeric variable - passenger_count
and ten categorical variables. For the following examples, we will focus on the total number of passengers during 2022, the last full year in the dataset:
Now, we can start explore the distribution of the passengers by different categorical variables combination. Let’s start with a simplistic example, plotting the distribution of passengers by geo_sumary
and geo_region
. The sunkey_ly
function required that the data will be aggregated by the categorical variables:
d1 <- d %>% group_by(geo_summary, geo_region) %>%
summarise(total = sum(passenger_count), .groups = "drop")
head(d1)
#> # A tibble: 6 × 3
#> geo_summary geo_region total
#> <chr> <chr> <int>
#> 1 Domestic US 46955185
#> 2 International Asia 3102633
#> 3 International Australia / Oceania 753566
#> 4 International Canada 1919214
#> 5 International Central America 431108
#> 6 International Europe 4905078
Now, as the data is ready, we visualize the data:
sankey_ly(x = d1,
cat_cols = c("geo_summary", "geo_region"),
num_col = "total",
title = "Distribution of Passengers by Geo Type and Region During 2022")
Similarly, we can add additional variables:
d %>%
filter(operating_airline == "United Airlines") %>%
group_by(operating_airline,activity_type_code, geo_summary, geo_region) %>%
summarise(total = sum(passenger_count), .groups = "drop") %>%
sankey_ly(cat_cols = c("operating_airline", "geo_summary", "geo_region", "activity_type_code"),
num_col = "total",
title = "Distribution of United Airlines Passengers at SFO During 2022")
Here is some pitful you should be aware of, the function is sensitive to the categorical variables’ names. If two variables share the same unique label, it will fail to map the different categories correctly. The following example, where both the geo_summary
and terminal
variables share the label international
, demonstrates this issue:
d %>%
filter(operating_airline == "United Airlines") %>%
group_by(operating_airline,activity_type_code, geo_summary, geo_region, terminal) %>%
summarise(total = sum(passenger_count), .groups = "drop") %>%
sankey_ly(cat_cols = c("operating_airline", "geo_summary", "geo_region", "activity_type_code", "terminal"),
num_col = "total",
title = "Distribution of United Airlines Passengers at SFO During 2022")
To overcome this issue, you will have to ensure no overlapping between the variables’ labels. Let’s modify the cases where the terminal variable is set as International
to international
(I
-> i
):
d %>%
filter(operating_airline == "United Airlines") %>%
mutate(terminal = ifelse(terminal == "International", "international", terminal)) %>%
group_by(operating_airline,activity_type_code, geo_summary, geo_region, terminal) %>%
summarise(total = sum(passenger_count), .groups = "drop") %>%
sankey_ly(cat_cols = c("operating_airline", "terminal","geo_summary", "geo_region", "activity_type_code"),
num_col = "total",
title = "Distribution of United Airlines Passengers at SFO During 2022")