Forecasting with the Fable Library

Second week of the forecasting sequence, after we reviewed last week Nixtla’s statsforecast Python library, this week we will review the fable R library.

The Fable library provides a set of univariate and multivariate time series forecasting models, such as ARIMA, ETS, and time series linear regression models. It also provides tools to train, test, and evaluate the models and functions to visualize the outputs.

The fable library goes side-by-side with the book of the week, Forecasting: Principles and Practice by Rob J. Hyndman and George Athanasopoulos. As mentioned above, the library is part of the tidy-verts ecosystem along with libraries such as tsibble, fabletools, and feasts.

Forecasting the demand for electricity in California

Here is a quick demo demonstrating how to train and forecast multiple forecasting models for multiple time series using the hourly demand for electricity in California. We will mainly focus on the core functionality of the library using some of the data visualization tools and forecasting models. I plan to create a more in-depth tutorial at some point in the future. All the code and Docker settings for VScode are available on the following repository.

We will start by loading the required libraries:

library(fable)
Loading required package: fabletools
Registered S3 method overwritten by 'tsibble':
  method               from 
  as_tibble.grouped_df dplyr
library(tsibble)

Attaching package: 'tsibble'
The following objects are masked from 'package:base':

    intersect, setdiff, union
library(feasts)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr)
library(lubridate)

Attaching package: 'lubridate'
The following object is masked from 'package:tsibble':

    interval
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
library(ggplot2)
library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Besides the tidy-verts library, we will use the dplyr, tidyr, and lubridate to process the data and time objects, and use the ggplot2 and GGally for visulize the data.

Data

We will continue to use the hourly demand for electricity in California by providers. This includes the following four providers:

  • Pacific Gas and Electric
  • Southern California Edison
  • San Diego Gas and Electric
  • Valley Electric Association

The data source is the EIA website, and a curated dataset is available in my workshop from the useR!2024 workshop - Deploy and Monitor ML Pipelines with Open Source and Free Applications. We will use the last two years for the following demonstration:

url <- "https://raw.githubusercontent.com/RamiKrispin/useR2024-pipeline-workshop/main/data/data.csv"

data <- read.csv(url)

head(data)
               period subba               subba_name parent
1 2018-07-01 08:00:00  PGAE Pacific Gas and Electric   CISO
2 2018-07-01 09:00:00  PGAE Pacific Gas and Electric   CISO
3 2018-07-01 10:00:00  PGAE Pacific Gas and Electric   CISO
4 2018-07-01 11:00:00  PGAE Pacific Gas and Electric   CISO
5 2018-07-01 12:00:00  PGAE Pacific Gas and Electric   CISO
6 2018-07-01 13:00:00  PGAE Pacific Gas and Electric   CISO
                             parent_name value   value_units   type
1 California Independent System Operator 12522 megawatthours actual
2 California Independent System Operator 11745 megawatthours actual
3 California Independent System Operator 11200 megawatthours actual
4 California Independent System Operator 10822 megawatthours actual
5 California Independent System Operator 10644 megawatthours actual
6 California Independent System Operator 10559 megawatthours actual

The fable library follows the tidyvers workflow, and it uses the tsibble object as input. Let’s reformat the input object by reformating the series timestamp and dropping unused columns:

start <- as.POSIXct("2022/8/1 0:00:00")
end <- as.POSIXct("2024/8/20 23:00:00")

data$time_temp <- ifelse(nchar(data$period) == 10, paste(data$period, "00:00:00", sep = " "), data$period)
data$time <- as.POSIXct(data$time_temp)


ts <- data |>
    dplyr::select(time, subba, y = value) |>
    dplyr::filter(
        time >= start & time <= end
    ) |>
    dplyr::arrange(subba, time) |>
    as_tsibble(index = time, key = subba)

The object is now ready to use:

head(ts)
# A tsibble: 6 x 3 [1h] <?>
# Key:       subba [1]
  time                subba     y
  <dttm>              <chr> <dbl>
1 2022-08-01 00:00:00 PGAE  12375
2 2022-08-01 01:00:00 PGAE  13233
3 2022-08-01 02:00:00 PGAE  14115
4 2022-08-01 03:00:00 PGAE  14813
5 2022-08-01 04:00:00 PGAE  14737
6 2022-08-01 05:00:00 PGAE  14831

Note: We define the object key as the electricity provider and the time column as the series index. The tsibble object uses the key to set the hierarchy of the series.

We will use the autoplot function to visualize the series:

ts |>
    autoplot() +
    labs(
        y = "MWh", x = "Time",
        title = "California Hourly Demand for Electricity by Sub Provider"
    )
Plot variable not specified, automatically selected `.vars = y`

The fable library provides a set of visualization functions for time series analysis. This includes functions for seasonal and correlation plots, as well as decomposition methods. Let’s review a few of those functions. Starting with the gg_season function that provides a seasonal plot:

ts |>
    gg_season(y, period = "day") +
    theme(legend.position = "none") +
    labs(y = "MWh", x = "Hour", title = "Hourly Demand for Electricity in California")

This view provides an hourly view of the series. You can modify, when applicable, the seasonal type using the period argument. In addition, you can filter the input object and check the seasonal patterns during a specific time window. For example, the last 90 days:

ts |>
    dplyr::filter(time > max(ts$time) - 60 * 60 * 24 * 30) |>
    gg_season(y, period = "day") +
    theme(legend.position = "none") +
    labs(
        y = "MWh", x = "Hour",
        title = "Hourly Demand for Electricity in California"
    )

Another nice visualization function is the ggpairs from GGally library that visualize the cross-correlation between the four series:

ts |>
    pivot_wider(values_from = y, names_from = subba) |>
    ggpairs(columns = 2:5)

The last visualization function we will review is the lag function, which provides a visual representative of the relationship between the series and its lags (similar to the ACF):

ts |>
    dplyr::filter(subba == "PGAE") |>
    gg_lag(y,
        lags = c(1:5, 24, 48, 168),
        geom = "point"
    ) +
    labs(x = "lag(y, k)")

Modeling

We will keep it simple, leaving the last 72 hours as a testing partition and training the models with the rest of the data. The library has a cross-validation (i.e., backtesting) function, but this is outside the scope of this review.

h <- 72

train <- ts |> dplyr::filter(time <= max(time) - hours(h))

test <- ts |> dplyr::filter(time > max(time) - hours(h))

Let’s visualize the testing partition:

test |>
    autoplot() +
    labs(
        y = "MWh", x = "Time",
        title = "Testing Set"
    )
Plot variable not specified, automatically selected `.vars = y`

md <- train |>
    model(
        ets = ETS(y),
        arima = ARIMA(y),
        lm1 = TSLM(y ~ trend() + season()),
        lm2 = TSLM(y ~ trend() + fourier(period = 24, K = 12)),
        lm3 = TSLM(y ~ trend() + fourier(period = 24, K = 3)),
        snaive1 = SNAIVE(y),
        snaive2 = SNAIVE(y ~ drift())
    )

Let’s review the train models object:

md
# A mable: 4 x 8
# Key:     subba [4]
  subba          ets                     arima     lm1     lm2     lm3  snaive1
  <chr>      <model>                   <model> <model> <model> <model>  <model>
1 PGAE  <ETS(M,N,M)> <ARIMA(2,0,2)(2,1,0)[24]>  <TSLM>  <TSLM>  <TSLM> <SNAIVE>
2 SCE   <ETS(M,N,M)> <ARIMA(2,0,2)(2,1,0)[24]>  <TSLM>  <TSLM>  <TSLM> <SNAIVE>
3 SDGE  <ETS(A,N,A)> <ARIMA(1,0,2)(2,1,0)[24]>  <TSLM>  <TSLM>  <TSLM> <SNAIVE>
4 VEA   <ETS(A,N,A)> <ARIMA(1,0,4)(0,1,1)[24]>  <TSLM>  <TSLM>  <TSLM> <SNAIVE>
# ℹ 1 more variable: snaive2 <model>

Once we train the models, we can go ahead and create a forecast:

fc <- md |>
    forecast(h = h)

The output object provides the point estimate and its distribution:

fc
# A fable: 2,016 x 5 [1h] <?>
# Key:     subba, .model [28]
   subba .model time                               y  .mean
   <chr> <chr>  <dttm>                        <dist>  <dbl>
 1 PGAE  ets    2024-08-18 00:00:00 N(12129, 117134) 12129.
 2 PGAE  ets    2024-08-18 01:00:00 N(13101, 224908) 13101.
 3 PGAE  ets    2024-08-18 02:00:00 N(13808, 347921) 13808.
 4 PGAE  ets    2024-08-18 03:00:00 N(14315, 479430) 14315.
 5 PGAE  ets    2024-08-18 04:00:00 N(14391, 591237) 14391.
 6 PGAE  ets    2024-08-18 05:00:00 N(14044, 664676) 14044.
 7 PGAE  ets    2024-08-18 06:00:00  N(13457, 7e+05) 13457.
 8 PGAE  ets    2024-08-18 07:00:00 N(12437, 680861) 12437.
 9 PGAE  ets    2024-08-18 08:00:00 N(11500, 650340) 11500.
10 PGAE  ets    2024-08-18 09:00:00 N(10911, 646950) 10911.
# ℹ 2,006 more rows
fc |>
    autoplot(test)

Last but not least, let’s evaluate the forecast performance on the testing partitions using basic dplyr functions:

fc |>
    dplyr::left_join(test |> dplyr::select(time, subba, actual = y), by = c("time", "subba")) |>
    as.data.frame() |>
    dplyr::group_by(subba, .model) |>
    dplyr::summarise(
        mape = mean(abs(actual - .mean) / actual),
        rmse = (mean((actual - .mean)^2))^0.5
    ) |>
    dplyr::arrange(subba, mape)
`summarise()` has grouped output by 'subba'. You can override using the
`.groups` argument.
# A tibble: 28 × 4
# Groups:   subba [4]
   subba .model    mape  rmse
   <chr> <chr>    <dbl> <dbl>
 1 PGAE  arima   0.0588  930.
 2 PGAE  lm3     0.0764 1035.
 3 PGAE  ets     0.0765 1002.
 4 PGAE  lm1     0.0769 1030.
 5 PGAE  lm2     0.0769 1030.
 6 PGAE  snaive2 0.0807 1214.
 7 PGAE  snaive1 0.0807 1215.
 8 SCE   arima   0.0532 1092.
 9 SCE   ets     0.0792 1794.
10 SCE   snaive1 0.0985 1786.
# ℹ 18 more rows