Create Sankey Diagram • sfo

A Sankey diagram is a common method for visualizing a numeric variable’s distribution across multiple categorical variables. The sankey_ly function is a wrapper for the plotly Sankey functionality. It enables a fast creation of a Sankey diagram without the need for data transformation and node mapping.

The function has the following arguments:

x - aggregate data.frame object must have at least two categorical variables and one numeric variable
cat_cols - a vector of at least two categorical columns names
num_col - a single numeric column name
title - optional, enables to set title for the plot

Usage

Let’s see some use cases of the function with the sfo_passengers dataset:

library(sfo)

data("sfo_passengers")

str(sfo_passengers)
#> 'data.frame':    50730 obs. of  12 variables:
#>  $ activity_period            : int  202212 202212 202212 202212 202212 202212 202212 202212 202212 202212 ...
#>  $ operating_airline          : chr  "EVA Airways" "EVA Airways" "Emirates" "Emirates" ...
#>  $ operating_airline_iata_code: chr  "BR" "BR" "EK" "EK" ...
#>  $ published_airline          : chr  "EVA Airways" "EVA Airways" "Emirates" "Emirates" ...
#>  $ published_airline_iata_code: chr  "BR" "BR" "EK" "EK" ...
#>  $ geo_summary                : chr  "International" "International" "International" "International" ...
#>  $ geo_region                 : chr  "Asia" "Asia" "Middle East" "Middle East" ...
#>  $ activity_type_code         : chr  "Deplaned" "Enplaned" "Deplaned" "Enplaned" ...
#>  $ price_category_code        : chr  "Other" "Other" "Other" "Other" ...
#>  $ terminal                   : chr  "International" "International" "International" "International" ...
#>  $ boarding_area              : chr  "G" "G" "A" "A" ...
#>  $ passenger_count            : int  12405 15151 13131 14985 2543 2883 1772 1370 2817 1987 ...

In the case of the sfo_passengers dataset, ignoring the date indicator, we have one numeric variable - passenger_count and ten categorical variables. For the following examples, we will focus on the total number of passengers during 2022, the last full year in the dataset:

library(dplyr)

d <- sfo_passengers %>% filter(activity_period >= 202201 & activity_period < 202301)

Now, we can start explore the distribution of the passengers by different categorical variables combination. Let’s start with a simplistic example, plotting the distribution of passengers by geo_sumary and geo_region. The sunkey_ly function required that the data will be aggregated by the categorical variables:

d1 <- d %>% group_by(geo_summary, geo_region) %>%
  summarise(total = sum(passenger_count), .groups = "drop")

head(d1)
#> # A tibble: 6 × 3
#>   geo_summary   geo_region             total
#>   <chr>         <chr>                  <int>
#> 1 Domestic      US                  46955185
#> 2 International Asia                 3102633
#> 3 International Australia / Oceania   753566
#> 4 International Canada               1919214
#> 5 International Central America       431108
#> 6 International Europe               4905078

Now, as the data is ready, we visualize the data:

sankey_ly(x = d1, 
          cat_cols = c("geo_summary", "geo_region"), 
          num_col = "total", 
          title = "Distribution of Passengers by Geo Type and Region During 2022")

Similarly, we can add additional variables:

d %>% 
  filter(operating_airline == "United Airlines") %>%
  group_by(operating_airline,activity_type_code, geo_summary, geo_region) %>%
  summarise(total = sum(passenger_count), .groups = "drop") %>%
  sankey_ly(cat_cols = c("operating_airline", "geo_summary", "geo_region", "activity_type_code"), 
            num_col = "total",
            title = "Distribution of United Airlines Passengers at SFO During 2022")

Failure!

Here is some pitful you should be aware of, the function is sensitive to the categorical variables’ names. If two variables share the same unique label, it will fail to map the different categories correctly. The following example, where both the geo_summary and terminal variables share the label international, demonstrates this issue:

d %>% 
  filter(operating_airline == "United Airlines") %>%
  group_by(operating_airline,activity_type_code, geo_summary, geo_region,  terminal) %>%
  summarise(total = sum(passenger_count), .groups = "drop") %>%
  sankey_ly(cat_cols = c("operating_airline", "geo_summary", "geo_region", "activity_type_code", "terminal"), 
            num_col = "total",
            title = "Distribution of United Airlines Passengers at SFO During 2022")

To overcome this issue, you will have to ensure no overlapping between the variables’ labels. Let’s modify the cases where the terminal variable is set as International to international (I -> i):

d %>% 
  filter(operating_airline == "United Airlines") %>%
  mutate(terminal = ifelse(terminal == "International", "international", terminal)) %>%
  group_by(operating_airline,activity_type_code, geo_summary, geo_region,  terminal) %>%
  summarise(total = sum(passenger_count), .groups = "drop") %>%
  sankey_ly(cat_cols = c("operating_airline", "terminal","geo_summary", "geo_region", "activity_type_code"), 
            num_col = "total",
            title = "Distribution of United Airlines Passengers at SFO During 2022")