The sfo_passengers dataset provides monthly statistics about the air traffic passengers at San Francisco International Airport (SFO) between July 2005 and December 2022. This vignette focuses on exploratory analysis approaches for this dataset.

Data prep

We will start with reviewing the data structure:

library(sfo)

data("sfo_passengers")

For simplicity, let’s use a shorter name, d, for the dataset:


d <- sfo_passengers

Next let’s review the characteristics of the dataset:


head(d)
#>   activity_period    operating_airline operating_airline_iata_code
#> 1          202212          EVA Airways                          BR
#> 2          202212          EVA Airways                          BR
#> 3          202212             Emirates                          EK
#> 4          202212             Emirates                          EK
#> 5          202212 Flair Airlines, Ltd.                          F8
#> 6          202212 Flair Airlines, Ltd.                          F8
#>      published_airline published_airline_iata_code   geo_summary  geo_region
#> 1          EVA Airways                          BR International        Asia
#> 2          EVA Airways                          BR International        Asia
#> 3             Emirates                          EK International Middle East
#> 4             Emirates                          EK International Middle East
#> 5 Flair Airlines, Ltd.                          F8 International      Canada
#> 6 Flair Airlines, Ltd.                          F8 International      Canada
#>   activity_type_code price_category_code      terminal boarding_area
#> 1           Deplaned               Other International             G
#> 2           Enplaned               Other International             G
#> 3           Deplaned               Other International             A
#> 4           Enplaned               Other International             A
#> 5           Deplaned            Low Fare International             A
#> 6           Enplaned            Low Fare International             A
#>   passenger_count
#> 1           12405
#> 2           15151
#> 3           13131
#> 4           14985
#> 5            2543
#> 6            2883

str(d)
#> 'data.frame':    50730 obs. of  12 variables:
#>  $ activity_period            : int  202212 202212 202212 202212 202212 202212 202212 202212 202212 202212 ...
#>  $ operating_airline          : chr  "EVA Airways" "EVA Airways" "Emirates" "Emirates" ...
#>  $ operating_airline_iata_code: chr  "BR" "BR" "EK" "EK" ...
#>  $ published_airline          : chr  "EVA Airways" "EVA Airways" "Emirates" "Emirates" ...
#>  $ published_airline_iata_code: chr  "BR" "BR" "EK" "EK" ...
#>  $ geo_summary                : chr  "International" "International" "International" "International" ...
#>  $ geo_region                 : chr  "Asia" "Asia" "Middle East" "Middle East" ...
#>  $ activity_type_code         : chr  "Deplaned" "Enplaned" "Deplaned" "Enplaned" ...
#>  $ price_category_code        : chr  "Other" "Other" "Other" "Other" ...
#>  $ terminal                   : chr  "International" "International" "International" "International" ...
#>  $ boarding_area              : chr  "G" "G" "A" "A" ...
#>  $ passenger_count            : int  12405 15151 13131 14985 2543 2883 1772 1370 2817 1987 ...

The activity_period variable provides the year-month of the observation using integer format. That is convenient if we filter by year and month, but less convenient if we want to plot the data as time series. Therefore, we will reformat the variable from YYYYMM to YYYYMMDD, setting the month’s day as 1st with the as.Date function:

d$date <- as.Date(paste(substr(d$activity_period, 1,4), substr(d$activity_period, 5,6), "01", sep ="/"))

head(d[,c("activity_period", "date")])
#>   activity_period       date
#> 1          202212 2022-12-01
#> 2          202212 2022-12-01
#> 3          202212 2022-12-01
#> 4          202212 2022-12-01
#> 5          202212 2022-12-01
#> 6          202212 2022-12-01

min(d$date)
#> [1] "2005-07-01"
max(d$date)
#> [1] "2022-12-01"

Exploratory analysis

After we reviewed and reformatted the data, we can start exploring the data and answer some questions such as:

  • The monthly number of passengers at SFO

  • The airlines’ distribution

  • The geographic distribution of the passengers

Monthly number of passengers

There are multiple ways to aggregate the datasets for monthly time-series data, such as the passengers’ travel type (arrival vs. departure), geo type, and their combination. Let’s start with the total aggregation of the data:

library(dplyr)
library(plotly)

d_total <- d %>%
  group_by(date) %>%
  summarise(total = sum(passenger_count))

head(d_total)
#> # A tibble: 6 × 2
#>   date         total
#>   <date>       <int>
#> 1 2005-07-01 6451538
#> 2 2005-08-01 6391732
#> 3 2005-09-01 5481106
#> 4 2005-10-01 5541430
#> 5 2005-11-01 5234666
#> 6 2005-12-01 5343594

plot_ly(data = d_total,
        x = ~ date,
        y = ~ total,
        type = "scatter",
        mode = "lines") %>%
  layout(title = "Monthly Air Traffic Passengers at SFO",
         yaxis = list(title = "Number of Passengers"),
         xaxis = list(title = "Source: San Francisco data portal (DataSF)"))

The COVID-19 pandemic effect can be seen in the series above, as the number of passengers sharply dropped in March 2020.

Likewise, we can aggregate the data by the activity type (arrival, departure, transit):


d_activity_type <- d %>%
  group_by(date, activity_type_code) %>%
  summarise(total = sum(passenger_count), .groups = "drop") %>%
  as.data.frame()

head(d_activity_type)
#>         date activity_type_code   total
#> 1 2005-07-01           Deplaned 3254658
#> 2 2005-07-01           Enplaned 3138824
#> 3 2005-07-01     Thru / Transit   58056
#> 4 2005-08-01           Deplaned 3228936
#> 5 2005-08-01           Enplaned 3106292
#> 6 2005-08-01     Thru / Transit   56504

plot_ly(data = d_activity_type,
        x = ~ date,
        y = ~ total,
        color = ~ activity_type_code,
        type = "scatter",
        mode = "lines") %>%
  layout(title = "Monthly Air Traffic Passengers at SFO by Activity Type",
         yaxis = list(title = "Number of Passengers"),
         xaxis = list(title = "Source: San Francisco data portal (DataSF)"))

The following plot provides the distribution of passengers by the flight geo type (international vs. domestic). We will use the pivot_wider function from the tidyr package to transform the data.frame into a wide format:

library(tidyr)

d_geo <- d %>%
  group_by(date, geo_summary) %>%
  summarise(total = sum(passenger_count), .groups = "drop") %>%
  pivot_wider(names_from = geo_summary, values_from = total) %>%
  as.data.frame()

head(d_geo)
#>         date Domestic International
#> 1 2005-07-01  4851804       1599734
#> 2 2005-08-01  4811808       1579924
#> 3 2005-09-01  4135846       1345260
#> 4 2005-10-01  4190238       1351192
#> 5 2005-11-01  3997758       1236908
#> 6 2005-12-01  4007680       1335914

plot_ly(data = d_geo, 
        x = ~ date, 
        y = ~ Domestic, 
        name = "Domestic", 
        type = "scatter", 
        mode = "none", 
        stackgroup = "one", 
        groupnorm = "percent", fillcolor = "#4C74C9") %>%
  add_trace(y = ~ International,
            name = "International",
            fillcolor = "#50CB86") %>%
  layout(title = 'United States Personal Expenditures by Categories',
         yaxis = list(title = "Proportion of Passengers", 
                      showgrid = FALSE,
                      ticksuffix = "%"),
         xaxis = list(title = "Source: San Francisco data portal (DataSF)",
                      showgrid = FALSE))

As expected, the distribution of passengers by geo type has changed since the start of the COVID-19 pandemic, increasing domestic flights’ proportion.

Airlines Analysis

The dataset provides two indicators about the airline company:

  • operating_airline field describes the aircraft operator’s name, and

  • published_airline filed describes the name of the airline that issues the ticket and books revenue

In most cases, it is the same company. In other cases, such as code-sharing operators, it could be two different companies. Let’s review the characteristics of those variables:


length(unique(d$operating_airline))
#> [1] 102
length(unique(d$published_airline))
#> [1] 92

You can notice that there are fewer published airlines compared to operating airlines. This potentially means that some operating airlines have lines with code sharing. For simplicity, we will focus on the operating_airline variable. Let’s review the unique combination of this variable:

sort(unique(d$operating_airline))
#>   [1] "ABC Aerolineas S.A. de C.V. dba Interjet"        
#>   [2] "Aer Lingus, Ltd."                                
#>   [3] "Aeromexico"                                      
#>   [4] "Air Berlin"                                      
#>   [5] "Air Canada"                                      
#>   [6] "Air Canada Jazz"                                 
#>   [7] "Air China"                                       
#>   [8] "Air France"                                      
#>   [9] "Air India Limited"                               
#>  [10] "Air Italy S.P.A"                                 
#>  [11] "Air New Zealand"                                 
#>  [12] "Air Pacific Limited dba Fiji Airways"            
#>  [13] "Air Transat"                                     
#>  [14] "AirTran Airways"                                 
#>  [15] "Alaska Airlines"                                 
#>  [16] "All Nippon Airways"                              
#>  [17] "Allegiant Air"                                   
#>  [18] "American Airlines"                               
#>  [19] "American Eagle Airlines"                         
#>  [20] "Ameriflight"                                     
#>  [21] "Asiana Airlines"                                 
#>  [22] "ATA Airlines"                                    
#>  [23] "Atlantic Southeast Airlines"                     
#>  [24] "Atlas Air, Inc"                                  
#>  [25] "BelAir Airlines"                                 
#>  [26] "Boeing Company"                                  
#>  [27] "Breeze Aviation Group, Inc."                     
#>  [28] "British Airways"                                 
#>  [29] "Cathay Pacific"                                  
#>  [30] "China Airlines"                                  
#>  [31] "China Eastern"                                   
#>  [32] "China Southern"                                  
#>  [33] "Compass Airlines"                                
#>  [34] "Condor Flugdienst GmbH"                          
#>  [35] "COPA Airlines, Inc."                             
#>  [36] "Delta Air Lines"                                 
#>  [37] "El Al Israel Airlines LTD."                      
#>  [38] "Emirates"                                        
#>  [39] "Etihad Airways"                                  
#>  [40] "EVA Airways"                                     
#>  [41] "Evergreen International Airlines"                
#>  [42] "ExpressJet Airlines"                             
#>  [43] "Finnair"                                         
#>  [44] "Flair Airlines, Ltd."                            
#>  [45] "French Bee"                                      
#>  [46] "Frontier Airlines"                               
#>  [47] "Hawaiian Airlines"                               
#>  [48] "Hong Kong Airlines Limited"                      
#>  [49] "Horizon Air"                                     
#>  [50] "Iberia"                                          
#>  [51] "Icelandair (Inactive)"                           
#>  [52] "Icelandair EHF"                                  
#>  [53] "Independence Air"                                
#>  [54] "Japan Airlines"                                  
#>  [55] "Jazz Aviation"                                   
#>  [56] "Jet Airways"                                     
#>  [57] "JetBlue Airways"                                 
#>  [58] "KLM Royal Dutch Airlines"                        
#>  [59] "Korean Air Lines"                                
#>  [60] "LAN Peru"                                        
#>  [61] "Lufthansa German Airlines"                       
#>  [62] "Mesa Airlines"                                   
#>  [63] "Mesaba Airlines"                                 
#>  [64] "Mexicana Airlines"                               
#>  [65] "Miami Air International"                         
#>  [66] "Midwest Airlines"                                
#>  [67] "Northwest Airlines (became Delta)"               
#>  [68] "Norwegian Air Shuttle ASA"                       
#>  [69] "Norwegian Air UK Ltd"                            
#>  [70] "Pacific Aviation"                                
#>  [71] "Philippine Airline, Inc. (INACTIVE - DO NOT USE)"
#>  [72] "Philippine Airlines"                             
#>  [73] "Qantas Airways"                                  
#>  [74] "Qatar Airways"                                   
#>  [75] "Republic Airlines"                               
#>  [76] "Samsic Airport America, LLC"                     
#>  [77] "SAS Airlines"                                    
#>  [78] "Servisair"                                       
#>  [79] "Singapore Airlines"                              
#>  [80] "SkyWest Airlines"                                
#>  [81] "Southwest Airlines"                              
#>  [82] "Spirit Airlines"                                 
#>  [83] "Sun Country Airlines"                            
#>  [84] "Swiss International"                             
#>  [85] "Swissport USA"                                   
#>  [86] "TACA International Airlines, S.A."               
#>  [87] "TAP Air Portugal"                                
#>  [88] "Thomas Cook Airlines"                            
#>  [89] "Trego Dugan Aviation"                            
#>  [90] "Turkish Airlines"                                
#>  [91] "United Airlines"                                 
#>  [92] "United Airlines - Pre 07/01/2013"                
#>  [93] "US Airways"                                      
#>  [94] "Vietnam Airlines JSC"                            
#>  [95] "Virgin America"                                  
#>  [96] "Virgin Atlantic"                                 
#>  [97] "Volaris Airlines"                                
#>  [98] "WestJet Airlines"                                
#>  [99] "World Airways"                                   
#> [100] "WOW Air"                                         
#> [101] "XL Airways France"                               
#> [102] "Xtra Airways"

The main thing that immediately pops out of this list is that United Airlines, the main carrier at SFO, is listed twice as United Airlines and United Airlines - Pre 07/01/2013. I could not find any reference that explains the second label. If analyzing data before July 2013, I would relabel those cases as United Airlines. For this analysis, we will focus on 2022, the most recent full year. Let’s start by filtering the data for 2022 flights:


d22 <- d %>% filter(activity_period >= 202201 & activity_period < 202301) %>%
  group_by(published_airline, geo_summary) %>%
  summarise(total = sum(passenger_count), .groups = "drop")

Before we summarize the data by airlines, let’s see the distribution of flights by geo type (domestic vs. international):


d22 %>% group_by(geo_summary) %>%
  summarise(total_geo = sum(total)) %>%
  plot_ly(labels = ~ geo_summary,
          values = ~ total_geo) %>%
  add_pie(hole = 0.6)

More than 75% of the air passenger traffic in SFO during 2022 were domestic. That is close to 47M passengers out of 60.8M passengers. Let’s use a treemap plot to see the distribution of passengers by airline and geo:

  plot_ly(data = d22 %>% filter(geo_summary == "Domestic"),
          labels = ~ published_airline, 
          values = ~ total,
          type = "treemap",
          parents=  ~geo_summary,
          domain = list(row=0),
          name = "Domestic",
          textinfo="label+value+percent parent") %>%
    add_trace(data = d22 %>% filter(geo_summary == "International"),
            labels = ~ published_airline, 
            values = ~ total,
            type = "treemap",
            parents=  ~geo_summary,
            domain = list(row=1),
            name = "International",
            textinfo="label+value+percent parent") %>%
    layout(title = "Passengers Distribution During 2022 by Geo",
      grid=list(columns=1, rows=2))

Air Traffic Summary

Last but not least, it would be interesting to leverage the different categorical variables to create a summary plot with Sankey. We will use the sankey_ly function, a plotly wrapper that automatically creates the data transformation and a Sankey plot with plotly. We will rank the top 20 airlines during 2022 and plot the passengers’ distribution by the airline, travel type (domestic or international), travel geo, activity type (deplaned, enplaned, and transit), and fare type (low or other):

rank <- sfo_passengers %>%
  filter(activity_period >= 202201 & activity_period <= 202212) %>%
  group_by(operating_airline) %>%
  summarise(total = sum(passenger_count), .groups = "drop") %>%
  arrange(-total) %>%
  slice_head(n = 20) %>%
  select(operating_airline)


sfo_passengers %>%
  filter(activity_period >= 202201 & activity_period <= 202212, operating_airline %in% rank$operating_airline) %>%
  group_by(operating_airline, geo_summary, geo_region, activity_type_code, 
           price_category_code, terminal) %>%
  summarise(total = sum(passenger_count), .groups = "drop") %>%
  sankey_ly(cat_cols = c("operating_airline","geo_summary",
                      "geo_region", "activity_type_code",
                      "price_category_code"),
          num_col = "total",
          title = "SFO Air Traffic Passengers Dist. During 2022")