The sfo_stats dataset provides monthly statistics on San Francisco International Airport’s air traffic landing between July 2005 and December 2020. The following vignette demonstrate some approaches for exploring the dataset. As the structure of the sfo_stats is similar to the sfo_passengers dataset, we will repeat the same data prep steps as shown on the previous vignette. We will use the dplyr and plotly packages for data manipulation and visualization, respectively.

Data prep

For simplicity, let’s use a shorter name , d, for the dataset:

library(sfo)
library(dplyr)
library(plotly)

d <- sfo_stats

head(d)
#>   activity_period operating_airline operating_airline_iata_code
#> 1          202212        Air Canada                          AC
#> 2          202212   Delta Air Lines                          DL
#> 3          202212   Delta Air Lines                          DL
#> 4          202212  WestJet Airlines                          WS
#> 5          202212  WestJet Airlines                          WS
#> 6          202212  WestJet Airlines                          WS
#>   published_airline published_airline_iata_code   geo_summary geo_region
#> 1        Air Canada                          AC International     Canada
#> 2   Delta Air Lines                          DL      Domestic         US
#> 3   Delta Air Lines                          DL      Domestic         US
#> 4  WestJet Airlines                          WS International     Canada
#> 5  WestJet Airlines                          WS International     Canada
#> 6  WestJet Airlines                          WS International     Canada
#>   landing_aircraft_type aircraft_body_type aircraft_manufacturer aircraft_model
#> 1             Passenger          Wide Body                Boeing           B773
#> 2             Passenger        Narrow Body                Airbus           A319
#> 3             Passenger        Narrow Body                Airbus           A320
#> 4             Passenger        Narrow Body                Boeing           B38M
#> 5             Passenger        Narrow Body                Boeing           B737
#> 6             Passenger        Narrow Body                Boeing           B738
#>   aircraft_version landing_count total_landed_weight
#> 1                -             1              554000
#> 2                -           163            22459444
#> 3                -            43             6114514
#> 4                -             5              764000
#> 5                -             5              646000
#> 6                -             3              438900

Next, let’s reformat the period indicator, activity_period to a Date object, setting the first day of the month as the default day:

d$date <- as.Date(paste(substr(d$activity_period, 1,4), 
                        substr(d$activity_period, 5,6), 
                        "01", sep ="/"))

We can see, with the str command, the stucture of the dataset:

str(d)
#> 'data.frame':    57381 obs. of  15 variables:
#>  $ activity_period            : int  202212 202212 202212 202212 202212 202212 202212 202212 202212 202212 ...
#>  $ operating_airline          : chr  "Air Canada" "Delta Air Lines" "Delta Air Lines" "WestJet Airlines" ...
#>  $ operating_airline_iata_code: chr  "AC" "DL" "DL" "WS" ...
#>  $ published_airline          : chr  "Air Canada" "Delta Air Lines" "Delta Air Lines" "WestJet Airlines" ...
#>  $ published_airline_iata_code: chr  "AC" "DL" "DL" "WS" ...
#>  $ geo_summary                : chr  "International" "Domestic" "Domestic" "International" ...
#>  $ geo_region                 : chr  "Canada" "US" "US" "Canada" ...
#>  $ landing_aircraft_type      : chr  "Passenger" "Passenger" "Passenger" "Passenger" ...
#>  $ aircraft_body_type         : chr  "Wide Body" "Narrow Body" "Narrow Body" "Narrow Body" ...
#>  $ aircraft_manufacturer      : chr  "Boeing" "Airbus" "Airbus" "Boeing" ...
#>  $ aircraft_model             : chr  "B773" "A319" "A320" "B38M" ...
#>  $ aircraft_version           : chr  "-" "-" "-" "-" ...
#>  $ landing_count              : int  1 163 43 5 5 3 22 33 19 269 ...
#>  $ total_landed_weight        : int  554000 22459444 6114514 764000 646000 438900 8360000 5660127 2779700 42313700 ...
#>  $ date                       : Date, format: "2022-12-01" "2022-12-01" ...

The data set has 11 categorical variables and two numeric variables - landing_count and total_landed_weight.

Exploratory analysis

Let’s start with viewing the total monthly number of landing in SFO:

d %>% 
  group_by(date) %>%
  summarise(landing_count = sum(landing_count)) %>%
  plot_ly(x = ~ date, y = ~ landing_count,
          type = "scatter", mode = "lines") %>% 
  layout(title = "Montly Landing in SFO Airport",
         yaxis = list(title = "Number of Landing"),
         xaxis = list(title = "Source: San Francisco data portal (DataSF)"))

As can seen in the aggregate plot above, the data has:

  • Strong seasonality, high activity during the summer mounts and low during winter time.
  • A significant drop in the number of landing since March 2020 due to the Covid19 pandemic

We can use plotly’s fill plot to review the distribution of landing at SFO by geo region:

d %>% 
  group_by(date, geo_region) %>%
  summarise(landing_count = sum(landing_count)) %>%
  as.data.frame() %>%
plot_ly(x = ~ date, 
        y = ~ landing_count,
        type = 'scatter', 
        mode = 'none', 
        stackgroup = 'one', 
        groupnorm = 'percent', fillcolor = ~ geo_region) %>%
  layout(title = "Dist. of Landing at SFO by Region",
         yaxis = list(title = "Percentage",
                      ticksuffix = "%"))

As expected, we can notice the change in geo’s landing distribution since March 2020 due to the Covid19 pandemic.

The aircraft_manufacturer column, as the name implies, provides the the aircraft manufacture. Let’s summarize the total landing during 2019, the most recent full calendar year, by the manufacturer type:

d %>% 
      filter(activity_period >= 201901 & activity_period < 202001,
             aircraft_manufacturer != "") %>%
      group_by(aircraft_manufacturer) %>%
      summarise(total_landing = sum(landing_count),
                `.groups` = "drop") %>%
      arrange(-total_landing) %>%
      plot_ly(labels = ~ aircraft_manufacturer,
              values = ~ total_landing) %>%
      add_pie(hole = 0.6) %>%
      layout(title = "Landing Distribution by Aircraft Manufacturer During 2019")

Similarly, we can add the aircract_body_type and get the distribution of landing airplans during 2019 by manufacturer and body type (e.g., wide, narrow, etc.):

d %>% 
      filter(activity_period >= 201901 & activity_period < 202001,
             aircraft_manufacturer != "") %>%
      group_by(aircraft_manufacturer, aircraft_body_type) %>%
      summarise(total_landing = sum(landing_count),
                `.groups` = "drop") %>%
      arrange(-total_landing)
#> # A tibble: 9 × 3
#>   aircraft_manufacturer aircraft_body_type total_landing
#>   <chr>                 <chr>                      <int>
#> 1 Boeing                Narrow Body               156286
#> 2 Airbus                Narrow Body               112296
#> 3 Boeing                Wide Body                  51900
#> 4 Embraer               Regional Jet               48648
#> 5 Bombardier            Regional Jet               41724
#> 6 Airbus                Wide Body                  11506
#> 7 Bombardier            Narrow Body                 2028
#> 8 McDonnell Douglas     Narrow Body                    6
#> 9 McDonnell Douglas     Wide Body                      2

A Sankey plot enables us to get a distribution flow of some numeric value by multiple categorical variables. In the following example, we will use the sankey_ly function to plot the distribution of landing during 2019 by geo, flight type, and aircraft details:

d %>%
  filter(activity_period >= 201901 & activity_period < 202001,
             aircraft_manufacturer != "") %>%
  group_by(geo_region, landing_aircraft_type, 
           aircraft_manufacturer, aircraft_model, 
           aircraft_body_type) %>%
  summarise(total_landing = sum(landing_count),
            groups = "drop") %>%
  sankey_ly(cat_cols = c("geo_region", 
                         "landing_aircraft_type", 
                         "aircraft_manufacturer",
                         "aircraft_model",
                         "aircraft_body_type"),
            num_col = "total_landing",
            title = "SFO Landing Summary by Geo Region and Aircraft Type During 2019")