v2_analyzing_sfo.Rmd
The sfo_passengers
dataset provides monthly statistics about the air traffic passengers at San Francisco International Airport (SFO) between July 2005 and December 2022. This vignette focuses on exploratory analysis approaches for this dataset.
We will start with reviewing the data structure:
For simplicity, let’s use a shorter name, d
, for the dataset:
d <- sfo_passengers
Next let’s review the characteristics of the dataset:
head(d)
#> activity_period operating_airline operating_airline_iata_code
#> 1 202212 EVA Airways BR
#> 2 202212 EVA Airways BR
#> 3 202212 Emirates EK
#> 4 202212 Emirates EK
#> 5 202212 Flair Airlines, Ltd. F8
#> 6 202212 Flair Airlines, Ltd. F8
#> published_airline published_airline_iata_code geo_summary geo_region
#> 1 EVA Airways BR International Asia
#> 2 EVA Airways BR International Asia
#> 3 Emirates EK International Middle East
#> 4 Emirates EK International Middle East
#> 5 Flair Airlines, Ltd. F8 International Canada
#> 6 Flair Airlines, Ltd. F8 International Canada
#> activity_type_code price_category_code terminal boarding_area
#> 1 Deplaned Other International G
#> 2 Enplaned Other International G
#> 3 Deplaned Other International A
#> 4 Enplaned Other International A
#> 5 Deplaned Low Fare International A
#> 6 Enplaned Low Fare International A
#> passenger_count
#> 1 12405
#> 2 15151
#> 3 13131
#> 4 14985
#> 5 2543
#> 6 2883
str(d)
#> 'data.frame': 50730 obs. of 12 variables:
#> $ activity_period : int 202212 202212 202212 202212 202212 202212 202212 202212 202212 202212 ...
#> $ operating_airline : chr "EVA Airways" "EVA Airways" "Emirates" "Emirates" ...
#> $ operating_airline_iata_code: chr "BR" "BR" "EK" "EK" ...
#> $ published_airline : chr "EVA Airways" "EVA Airways" "Emirates" "Emirates" ...
#> $ published_airline_iata_code: chr "BR" "BR" "EK" "EK" ...
#> $ geo_summary : chr "International" "International" "International" "International" ...
#> $ geo_region : chr "Asia" "Asia" "Middle East" "Middle East" ...
#> $ activity_type_code : chr "Deplaned" "Enplaned" "Deplaned" "Enplaned" ...
#> $ price_category_code : chr "Other" "Other" "Other" "Other" ...
#> $ terminal : chr "International" "International" "International" "International" ...
#> $ boarding_area : chr "G" "G" "A" "A" ...
#> $ passenger_count : int 12405 15151 13131 14985 2543 2883 1772 1370 2817 1987 ...
The activity_period
variable provides the year-month of the observation using integer format. That is convenient if we filter by year and month, but less convenient if we want to plot the data as time series. Therefore, we will reformat the variable from YYYYMM to YYYYMMDD, setting the month’s day as 1st with the as.Date
function:
d$date <- as.Date(paste(substr(d$activity_period, 1,4), substr(d$activity_period, 5,6), "01", sep ="/"))
head(d[,c("activity_period", "date")])
#> activity_period date
#> 1 202212 2022-12-01
#> 2 202212 2022-12-01
#> 3 202212 2022-12-01
#> 4 202212 2022-12-01
#> 5 202212 2022-12-01
#> 6 202212 2022-12-01
min(d$date)
#> [1] "2005-07-01"
max(d$date)
#> [1] "2022-12-01"
After we reviewed and reformatted the data, we can start exploring the data and answer some questions such as:
The monthly number of passengers at SFO
The airlines’ distribution
The geographic distribution of the passengers
There are multiple ways to aggregate the datasets for monthly time-series data, such as the passengers’ travel type (arrival vs. departure), geo type, and their combination. Let’s start with the total aggregation of the data:
library(dplyr)
library(plotly)
d_total <- d %>%
group_by(date) %>%
summarise(total = sum(passenger_count))
head(d_total)
#> # A tibble: 6 × 2
#> date total
#> <date> <int>
#> 1 2005-07-01 6451538
#> 2 2005-08-01 6391732
#> 3 2005-09-01 5481106
#> 4 2005-10-01 5541430
#> 5 2005-11-01 5234666
#> 6 2005-12-01 5343594
plot_ly(data = d_total,
x = ~ date,
y = ~ total,
type = "scatter",
mode = "lines") %>%
layout(title = "Monthly Air Traffic Passengers at SFO",
yaxis = list(title = "Number of Passengers"),
xaxis = list(title = "Source: San Francisco data portal (DataSF)"))
The COVID-19 pandemic effect can be seen in the series above, as the number of passengers sharply dropped in March 2020.
Likewise, we can aggregate the data by the activity type (arrival, departure, transit):
d_activity_type <- d %>%
group_by(date, activity_type_code) %>%
summarise(total = sum(passenger_count), .groups = "drop") %>%
as.data.frame()
head(d_activity_type)
#> date activity_type_code total
#> 1 2005-07-01 Deplaned 3254658
#> 2 2005-07-01 Enplaned 3138824
#> 3 2005-07-01 Thru / Transit 58056
#> 4 2005-08-01 Deplaned 3228936
#> 5 2005-08-01 Enplaned 3106292
#> 6 2005-08-01 Thru / Transit 56504
plot_ly(data = d_activity_type,
x = ~ date,
y = ~ total,
color = ~ activity_type_code,
type = "scatter",
mode = "lines") %>%
layout(title = "Monthly Air Traffic Passengers at SFO by Activity Type",
yaxis = list(title = "Number of Passengers"),
xaxis = list(title = "Source: San Francisco data portal (DataSF)"))
The following plot provides the distribution of passengers by the flight geo type (international vs. domestic). We will use the pivot_wider
function from the tidyr package to transform the data.frame
into a wide format:
library(tidyr)
d_geo <- d %>%
group_by(date, geo_summary) %>%
summarise(total = sum(passenger_count), .groups = "drop") %>%
pivot_wider(names_from = geo_summary, values_from = total) %>%
as.data.frame()
head(d_geo)
#> date Domestic International
#> 1 2005-07-01 4851804 1599734
#> 2 2005-08-01 4811808 1579924
#> 3 2005-09-01 4135846 1345260
#> 4 2005-10-01 4190238 1351192
#> 5 2005-11-01 3997758 1236908
#> 6 2005-12-01 4007680 1335914
plot_ly(data = d_geo,
x = ~ date,
y = ~ Domestic,
name = "Domestic",
type = "scatter",
mode = "none",
stackgroup = "one",
groupnorm = "percent", fillcolor = "#4C74C9") %>%
add_trace(y = ~ International,
name = "International",
fillcolor = "#50CB86") %>%
layout(title = 'United States Personal Expenditures by Categories',
yaxis = list(title = "Proportion of Passengers",
showgrid = FALSE,
ticksuffix = "%"),
xaxis = list(title = "Source: San Francisco data portal (DataSF)",
showgrid = FALSE))
As expected, the distribution of passengers by geo type has changed since the start of the COVID-19 pandemic, increasing domestic flights’ proportion.
The dataset provides two indicators about the airline company:
operating_airline
field describes the aircraft operator’s name, and
published_airline
filed describes the name of the airline that issues the ticket and books revenue
In most cases, it is the same company. In other cases, such as code-sharing operators, it could be two different companies. Let’s review the characteristics of those variables:
You can notice that there are fewer published airlines compared to operating airlines. This potentially means that some operating airlines have lines with code sharing. For simplicity, we will focus on the operating_airline
variable. Let’s review the unique combination of this variable:
sort(unique(d$operating_airline))
#> [1] "ABC Aerolineas S.A. de C.V. dba Interjet"
#> [2] "Aer Lingus, Ltd."
#> [3] "Aeromexico"
#> [4] "Air Berlin"
#> [5] "Air Canada"
#> [6] "Air Canada Jazz"
#> [7] "Air China"
#> [8] "Air France"
#> [9] "Air India Limited"
#> [10] "Air Italy S.P.A"
#> [11] "Air New Zealand"
#> [12] "Air Pacific Limited dba Fiji Airways"
#> [13] "Air Transat"
#> [14] "AirTran Airways"
#> [15] "Alaska Airlines"
#> [16] "All Nippon Airways"
#> [17] "Allegiant Air"
#> [18] "American Airlines"
#> [19] "American Eagle Airlines"
#> [20] "Ameriflight"
#> [21] "Asiana Airlines"
#> [22] "ATA Airlines"
#> [23] "Atlantic Southeast Airlines"
#> [24] "Atlas Air, Inc"
#> [25] "BelAir Airlines"
#> [26] "Boeing Company"
#> [27] "Breeze Aviation Group, Inc."
#> [28] "British Airways"
#> [29] "Cathay Pacific"
#> [30] "China Airlines"
#> [31] "China Eastern"
#> [32] "China Southern"
#> [33] "Compass Airlines"
#> [34] "Condor Flugdienst GmbH"
#> [35] "COPA Airlines, Inc."
#> [36] "Delta Air Lines"
#> [37] "El Al Israel Airlines LTD."
#> [38] "Emirates"
#> [39] "Etihad Airways"
#> [40] "EVA Airways"
#> [41] "Evergreen International Airlines"
#> [42] "ExpressJet Airlines"
#> [43] "Finnair"
#> [44] "Flair Airlines, Ltd."
#> [45] "French Bee"
#> [46] "Frontier Airlines"
#> [47] "Hawaiian Airlines"
#> [48] "Hong Kong Airlines Limited"
#> [49] "Horizon Air"
#> [50] "Iberia"
#> [51] "Icelandair (Inactive)"
#> [52] "Icelandair EHF"
#> [53] "Independence Air"
#> [54] "Japan Airlines"
#> [55] "Jazz Aviation"
#> [56] "Jet Airways"
#> [57] "JetBlue Airways"
#> [58] "KLM Royal Dutch Airlines"
#> [59] "Korean Air Lines"
#> [60] "LAN Peru"
#> [61] "Lufthansa German Airlines"
#> [62] "Mesa Airlines"
#> [63] "Mesaba Airlines"
#> [64] "Mexicana Airlines"
#> [65] "Miami Air International"
#> [66] "Midwest Airlines"
#> [67] "Northwest Airlines (became Delta)"
#> [68] "Norwegian Air Shuttle ASA"
#> [69] "Norwegian Air UK Ltd"
#> [70] "Pacific Aviation"
#> [71] "Philippine Airline, Inc. (INACTIVE - DO NOT USE)"
#> [72] "Philippine Airlines"
#> [73] "Qantas Airways"
#> [74] "Qatar Airways"
#> [75] "Republic Airlines"
#> [76] "Samsic Airport America, LLC"
#> [77] "SAS Airlines"
#> [78] "Servisair"
#> [79] "Singapore Airlines"
#> [80] "SkyWest Airlines"
#> [81] "Southwest Airlines"
#> [82] "Spirit Airlines"
#> [83] "Sun Country Airlines"
#> [84] "Swiss International"
#> [85] "Swissport USA"
#> [86] "TACA International Airlines, S.A."
#> [87] "TAP Air Portugal"
#> [88] "Thomas Cook Airlines"
#> [89] "Trego Dugan Aviation"
#> [90] "Turkish Airlines"
#> [91] "United Airlines"
#> [92] "United Airlines - Pre 07/01/2013"
#> [93] "US Airways"
#> [94] "Vietnam Airlines JSC"
#> [95] "Virgin America"
#> [96] "Virgin Atlantic"
#> [97] "Volaris Airlines"
#> [98] "WestJet Airlines"
#> [99] "World Airways"
#> [100] "WOW Air"
#> [101] "XL Airways France"
#> [102] "Xtra Airways"
The main thing that immediately pops out of this list is that United Airlines, the main carrier at SFO, is listed twice as United Airlines
and United Airlines - Pre 07/01/2013
. I could not find any reference that explains the second label. If analyzing data before July 2013, I would relabel those cases as United Airlines
. For this analysis, we will focus on 2022, the most recent full year. Let’s start by filtering the data for 2022 flights:
d22 <- d %>% filter(activity_period >= 202201 & activity_period < 202301) %>%
group_by(published_airline, geo_summary) %>%
summarise(total = sum(passenger_count), .groups = "drop")
Before we summarize the data by airlines, let’s see the distribution of flights by geo type (domestic vs. international):
d22 %>% group_by(geo_summary) %>%
summarise(total_geo = sum(total)) %>%
plot_ly(labels = ~ geo_summary,
values = ~ total_geo) %>%
add_pie(hole = 0.6)
More than 75% of the air passenger traffic in SFO during 2022 were domestic. That is close to 47M passengers out of 60.8M passengers. Let’s use a treemap plot to see the distribution of passengers by airline and geo:
plot_ly(data = d22 %>% filter(geo_summary == "Domestic"),
labels = ~ published_airline,
values = ~ total,
type = "treemap",
parents= ~geo_summary,
domain = list(row=0),
name = "Domestic",
textinfo="label+value+percent parent") %>%
add_trace(data = d22 %>% filter(geo_summary == "International"),
labels = ~ published_airline,
values = ~ total,
type = "treemap",
parents= ~geo_summary,
domain = list(row=1),
name = "International",
textinfo="label+value+percent parent") %>%
layout(title = "Passengers Distribution During 2022 by Geo",
grid=list(columns=1, rows=2))
Last but not least, it would be interesting to leverage the different categorical variables to create a summary plot with Sankey. We will use the sankey_ly
function, a plotly wrapper that automatically creates the data transformation and a Sankey plot with plotly. We will rank the top 20 airlines during 2022 and plot the passengers’ distribution by the airline, travel type (domestic or international), travel geo, activity type (deplaned, enplaned, and transit), and fare type (low or other):
rank <- sfo_passengers %>%
filter(activity_period >= 202201 & activity_period <= 202212) %>%
group_by(operating_airline) %>%
summarise(total = sum(passenger_count), .groups = "drop") %>%
arrange(-total) %>%
slice_head(n = 20) %>%
select(operating_airline)
sfo_passengers %>%
filter(activity_period >= 202201 & activity_period <= 202212, operating_airline %in% rank$operating_airline) %>%
group_by(operating_airline, geo_summary, geo_region, activity_type_code,
price_category_code, terminal) %>%
summarise(total = sum(passenger_count), .groups = "drop") %>%
sankey_ly(cat_cols = c("operating_airline","geo_summary",
"geo_region", "activity_type_code",
"price_category_code"),
num_col = "total",
title = "SFO Air Traffic Passengers Dist. During 2022")