library(feasts)
library(fabletools)
library(tsibble)
library(dplyr)
library(plotly)
Forecasting with Linear Regression
Required Libraries
Loading supporting functions:
source("./functions.R")
Load Data
load(file = "./data/ts.RData")
Modeling Trend
head(ts1)
# A tsibble: 6 x 3 [1Y]
index y series_id
<int> <int> <chr>
1 1986 7626 SCA
2 1987 7904858 SCA
3 1988 8113034 SCA
4 1989 8313776 SCA
5 1990 8497848 SCA
6 1991 8634774 SCA
<- ts1 |>
ts1 ::filter(index > 1986)
dplyr
<- plot_ly() |>
p add_lines(x = ts1$index, y = ts1$y, type = "scatter", mode = "lines", name = "Actual")
p
Fitting a Linear Trend
To model the trend we will create an index variable and use linear regression model fit the index against the dependent variable - the series of interest:
<- ts1 |>
ts1 ::mutate(trend = 1:nrow(ts1))
dplyr
head(ts1)
# A tsibble: 6 x 4 [1Y]
index y series_id trend
<int> <int> <chr> <int>
1 1987 7904858 SCA 1
2 1988 8113034 SCA 2
3 1989 8313776 SCA 3
4 1990 8497848 SCA 4
5 1991 8634774 SCA 5
6 1992 8680613 SCA 6
Let’s now set a regression model:
<- lm(y ~ trend, data = ts1)
md1
summary(md1)
Call:
lm(formula = y ~ trend, data = ts1)
Residuals:
Min 1Q Median 3Q Max
-298308 -94696 -14364 98742 332048
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8107966 54683 148.27 <2e-16 ***
trend 95200 2509 37.94 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 163000 on 35 degrees of freedom
Multiple R-squared: 0.9763, Adjusted R-squared: 0.9756
F-statistic: 1440 on 1 and 35 DF, p-value: < 2.2e-16
We will fit the model on the series to see the trend fit:
<- predict(object = md1, newdata = ts1, interval = "confidence", level = 0.95)
fit1 $fit1 <- fit1[, 1]
ts1<- p |>
p1 add_lines(x = ts1$index, y = ts1$fit1, mode = "lines", line = list(color = "black", dash = "dash"), name = "Fitted")
p1
Residuals Analysis
The goal of the residual analysis is to check the goodness of fit of the model with data the it was encountered during the fit process. The residual could indicate whether the model has a good fit or or some patterns left and a direction about the type of features that are needed for better predictions.
$res1 <- ts1$y - ts1$fit1
ts1
plot_ly(x = ts1$index, y = ts1$res1, type = "scatter", mode = "markers")
Clearly, you can see that some patterns are left in the residuals. We can check for autocorrelation:
plot_acf(ts = ts1, var = "res1", lag_max = 60, frequency = NULL, alpha = 0.05)
The ACF plot indicates that the residuals have a high correlation with the first lags. This is classic case, that adding an ARIMA model (i.e., regression with ARIMA errors) could improve the fit of the model.
Forecast the Trend
To forecast the future observations of the series, we will have to generate a new data frame with the corresponding features we used to train the model. Let’ say that we want to generate a 5 year forecast. We will use the new_data
function from the tsibble library to generate a new data frame with the corresponding features for the next 5 years:
<- 5
h <- new_data(ts1, n = h)
future_data future_data
# A tsibble: 5 x 1 [1Y]
index
<dbl>
1 2024
2 2025
3 2026
4 2027
5 2028
Next, we will generate the trend feature:
<- max(ts1$trend) + 1
trend_start <- trend_start + h - 1
trend_end $trend <- trend_start:trend_end
future_data future_data
# A tsibble: 5 x 2 [1Y]
index trend
<dbl> <int>
1 2024 38
2 2025 39
3 2026 40
4 2027 41
5 2028 42
Now we can use the predict
function to create the forecast:
<- predict(object = md1, newdata = future_data, interval = "prediction", level = 0.95)
fc1
$yhat <- fc1[,1]
future_data$lower <- fc1[,2]
future_data$upper <- fc1[,3]
future_data future_data
# A tsibble: 5 x 5 [1Y]
index trend yhat lower upper
<dbl> <int> <dbl> <dbl> <dbl>
1 2024 38 11725578. 11376638. 12074518.
2 2025 39 11820778. 11470391. 12171165.
3 2026 40 11915978. 11564077. 12267880.
4 2027 41 12011178. 11657695. 12364662.
5 2028 42 12106379. 11751248. 12461510.
Let’s plot the forecast
|>
p1 add_ribbons(x = future_data$index, ymin = future_data$lower,
ymax = future_data$upper, name = "95% Prediction Interval",
line = list(color = 'rgba(7, 164, 181, 0.05)'),
fillcolor = 'rgba(7, 164, 181, 0.2)') |>
add_lines(x = future_data$index,
y = future_data$yhat,
name = "Forecast",
line = list(color = "black", dash = "dash"))
Improving the model
Reviewing the results, there are two features we can use to improve the forecast:
- Piecewise linear trend
- Adding AR model
- Combing both approaches
Piecewise Linear Trend
Eyeballing the series, you can notice a change in the trend at around 2008. We will use this information to create a piecewise linear trend.
<- ts1$trend[which(ts1$index == 2008)]
s $trend2 <- pmax(0, ts1$trend - s)
ts1
<- lm(y ~ trend + trend2, data = ts1)
md2
summary(md2)
Call:
lm(formula = y ~ trend + trend2, data = ts1)
Residuals:
Min 1Q Median 3Q Max
-144490 -28221 -6315 24236 165205
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7888615 30832 255.86 < 2e-16 ***
trend 116191 2107 55.14 < 2e-16 ***
trend2 -55336 4692 -11.79 1.45e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 73280 on 34 degrees of freedom
Multiple R-squared: 0.9953, Adjusted R-squared: 0.9951
F-statistic: 3629 on 2 and 34 DF, p-value: < 2.2e-16
Let’s fit the model and plot it:
<- predict(object = md2, newdata = ts1, interval = "confidence", level = 0.95)
fit2$fit2 <- fit2[, 1]
ts1<- p |>
p2 add_lines(x = ts1$index, y = ts1$fit2, mode = "lines", line = list(color = "black", dash = "dash"), name = "Fitted")
p2
We will add to the plot the features:
<- plot_ly() |>
trend1 add_lines(x = ts1$index, y = ts1$trend, name = "Main Trend Feature")
<- plot_ly() |>
trend2 add_lines(x = ts1$index, y = ts1$trend2, name = "Piecewise Trend Feature")
subplot(p2, trend1, trend2, nrows = 3)
Let’s review the residuals:
$res2 <- ts1$y - ts1$fit2
ts1
plot_ly(x = ts1$index, y = ts1$res2, type = "scatter", mode = "markers")
And the ACF of the residuals:
plot_acf(ts = ts1, var = "res2", lag_max = 60, frequency = NULL, alpha = 0.05)
We can noticed from the residuals and ACF plots that the model goodness of fit improved with the addition of a second trend feature. Yet, the ACF plot indicate that some correlation left on the residuals within the first lag. An AR model with order 1 can help to remove this correlation.
AR model
The Auto Regressive (AR) model is a common approach in time series forecasting to model time series with high correlation with its past lags. The AR model is part of a family of models that construct the ARIMA family of models.
At its core, AR model is a regression of the series with its previous lags. For example, AR 1 model would be a regression of the series with its lag one:
\[ Y_t = \alpha + \beta * Y_{t-1} \]
Let’s create a lag 1 feature:
$lag1 <- dplyr::lag(ts1$y, n = 1)
ts1
ts1
# A tsibble: 37 x 10 [1Y]
index y series_id trend fit1 res1 trend2 fit2 res2 lag1
<int> <int> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 1987 7904858 SCA 1 8203166. -298308. 0 8.00e6 -99948. NA
2 1988 8113034 SCA 2 8298367. -185333. 0 8.12e6 -7963. 7904858
3 1989 8313776 SCA 3 8393567. -79791. 0 8.24e6 76588. 8113034
4 1990 8497848 SCA 4 8488767. 9081. 0 8.35e6 144469. 8313776
5 1991 8634774 SCA 5 8583968. 50806. 0 8.47e6 165205. 8497848
6 1992 8680613 SCA 6 8679168. 1445. 0 8.59e6 94853. 8634774
7 1993 8726187 SCA 7 8774368. -48181. 0 8.70e6 24236. 8680613
8 1994 8790733 SCA 8 8869569. -78836. 0 8.82e6 -27409. 8726187
9 1995 8865541 SCA 9 8964769. -99228. 0 8.93e6 -68792. 8790733
10 1996 8969308 SCA 10 9059969. -90661. 0 9.05e6 -81215. 8865541
# ℹ 27 more rows
Let’s now fit the model:
<- lm(y ~ trend + lag1, data = ts1)
md3
summary(md3)
Call:
lm(formula = y ~ trend + lag1, data = ts1)
Residuals:
Min 1Q Median 3Q Max
-114027 -37159 3657 32800 126391
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.944e+05 5.156e+05 1.541 0.133
trend 5.062e+03 6.291e+03 0.805 0.427
lag1 9.193e-01 6.443e-02 14.267 1.14e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 59240 on 33 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.9967, Adjusted R-squared: 0.9965
F-statistic: 4970 on 2 and 33 DF, p-value: < 2.2e-16
We will refit the model and plot it:
<- predict(object = md3, newdata = ts1, interval = "confidence", level = 0.95)
fit3$fit3 <- fit3[, 1]
ts1<- p |>
p3 add_lines(x = ts1$index, y = ts1$fit3, mode = "lines", line = list(color = "black", dash = "dash"), name = "Fitted")
Let’s review the residual ACF plot:
$res3 <- ts1$y - ts1$fit3
ts1
plot_acf(ts = ts1, var = "res3", lag_max = 60, frequency = NULL, alpha = 0.05)
The main downside of using lags are that you will have to produce the lags for your forecast assuming the lag number is smaller than the forecast horizon. In this case, you will have to use a recursive function and use previous predication values as lags for your next prediction.
A great article about piecewise linear trends modeling can be found in the following blog post by Prof. Rob J Hyndman.
Modeling Seasonal Time Series
Let’s now see how can we model a seasonal time series with linear regression.
<- plot_ly() |>
p add_lines(x = ts2$date, y = ts2$y, name = "Actual")
p
Let’s decompose the series:
<- ts2 |> model( STL(y)) |> components()
stl_d
plot_decomposition(stl_d, var = "y", outliers = TRUE)
We will build the features step-by-step to review the impact of each feature on the goodness of fit. Let’s start by defining the trend feature:
$trend <- 1:nrow(ts2)
ts2
ts2
# A tsibble: 292 x 5 [1M]
index date y series_id trend
<mth> <date> <int> <chr> <int>
1 2001 Jan 2001-01-01 2505011 NUS 1
2 2001 Feb 2001-02-01 2156873 NUS 2
3 2001 Mar 2001-03-01 2086568 NUS 3
4 2001 Apr 2001-04-01 1663832 NUS 4
5 2001 May 2001-05-01 1385163 NUS 5
6 2001 Jun 2001-06-01 1313119 NUS 6
7 2001 Jul 2001-07-01 1459919 NUS 7
8 2001 Aug 2001-08-01 1528483 NUS 8
9 2001 Sep 2001-09-01 1360871 NUS 9
10 2001 Oct 2001-10-01 1507428 NUS 10
# ℹ 282 more rows
<- lm(y ~ trend, data = ts2)
mds1
summary(mds1)
Call:
lm(formula = y ~ trend, data = ts2)
Residuals:
Min 1Q Median 3Q Max
-520488 -312830 -136594 304148 1143010
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1554363.7 46010.9 33.78 <2e-16 ***
trend 3161.1 272.2 11.61 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 392100 on 290 degrees of freedom
Multiple R-squared: 0.3174, Adjusted R-squared: 0.315
F-statistic: 134.8 on 1 and 290 DF, p-value: < 2.2e-16
Let’s fit the model and plot it:
<- predict(object = mds1, newdata = ts2, interval = "confidence", level = 0.95)
fit_s1 $fit1 <- fit_s1[, 1]
ts2<- p |>
p1 add_lines(x = ts2$date, y = ts2$fit1, mode = "lines", line = list(color = "black", dash = "dash"), name = "Fitted")
p1
$res1 <- ts2$y - ts2$fit1 ts2
If you check the model residuals, you will noticed that the residuals are detrained version of the original series.
We will now add the seasonal feature by creating a categorical variable for each month of the year:
$month <- lubridate::month(ts2$index, label = TRUE)
ts2 ts2
# A tsibble: 292 x 8 [1M]
index date y series_id trend fit1 res1 month
<mth> <date> <int> <chr> <int> <dbl> <dbl> <ord>
1 2001 Jan 2001-01-01 2505011 NUS 1 1557525. 947486. Jan
2 2001 Feb 2001-02-01 2156873 NUS 2 1560686. 596187. Feb
3 2001 Mar 2001-03-01 2086568 NUS 3 1563847. 522721. Mar
4 2001 Apr 2001-04-01 1663832 NUS 4 1567008. 96824. Apr
5 2001 May 2001-05-01 1385163 NUS 5 1570169. -185006. May
6 2001 Jun 2001-06-01 1313119 NUS 6 1573330. -260211. Jun
7 2001 Jul 2001-07-01 1459919 NUS 7 1576492. -116573. Jul
8 2001 Aug 2001-08-01 1528483 NUS 8 1579653. -51170. Aug
9 2001 Sep 2001-09-01 1360871 NUS 9 1582814. -221943. Sep
10 2001 Oct 2001-10-01 1507428 NUS 10 1585975. -78547. Oct
# ℹ 282 more rows
<- lm(y ~ trend + month, data = ts2)
mds2
summary(mds2)
Call:
lm(formula = y ~ trend + month, data = ts2)
Residuals:
Min 1Q Median 3Q Max
-502902 -80824 -9854 89213 381288
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1553263.86 14898.52 104.256 < 2e-16 ***
trend 3139.58 88.15 35.617 < 2e-16 ***
month.L -462479.01 25661.43 -18.022 < 2e-16 ***
month.Q 1112284.49 25741.67 43.209 < 2e-16 ***
month.C 25030.45 25691.87 0.974 0.330774
month^4 202426.46 25711.62 7.873 7.72e-14 ***
month^5 277448.53 25726.76 10.784 < 2e-16 ***
month^6 -87480.59 25698.73 -3.404 0.000761 ***
month^7 -208327.63 25736.91 -8.095 1.79e-14 ***
month^8 79379.32 25733.10 3.085 0.002242 **
month^9 41075.62 25719.98 1.597 0.111391
month^10 71934.03 25803.57 2.788 0.005672 **
month^11 -17836.16 25888.14 -0.689 0.491414
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 126900 on 279 degrees of freedom
Multiple R-squared: 0.9312, Adjusted R-squared: 0.9282
F-statistic: 314.6 on 12 and 279 DF, p-value: < 2.2e-16
As you can noticed from the model summary, adding the seasonal feature improved the model goodness of fit.
<- predict(object = mds2, newdata = ts2, interval = "confidence", level = 0.95)
fit_s2 $fit2 <- fit_s2[, 1]
ts2<- p |>
p2 add_lines(x = ts2$date, y = ts2$fit2, mode = "lines", line = list(color = "black", dash = "dash"), name = "Fitted")
p2
Let’s review now the model residuals:
$res2 <- ts2$y - ts2$fit2
ts2
plot_ly(x = ts2$date, y = ts2$res2, type = "scatter", mode = "markers")
And the ACF of the residuals:
plot_acf(ts = ts2, var = "res2", lag_max = 60, frequency = NULL, alpha = 0.05)
The is a ton of information that left on the residuals!
We will explore the following features:
- Structural breaks
- Piecewise Linear Trend
- AR
- Outliers
Structural Break Feature
Often, time series would have an abrupt shift in its structure due to some event. A good example, is the Covid-19 pandemic impact on some macro-economic indicators such as the unemployment rate and number of passengers traveling on airplanes. At the end of this change, the series would be in a new level.
We can notice that the natural gas series had a structural break on the trend component around September 2018. To model a structural break, we will create a new variable that is equal to 0 prior to September 2018 and 1 after this date:
$structural_break <- ifelse(ts2$date >= as.Date("2018-09-01"), 1, 0)
ts2
|> head() ts2
# A tsibble: 6 x 11 [1M]
index date y series_id trend fit1 res1 month fit2 res2
<mth> <date> <int> <chr> <int> <dbl> <dbl> <ord> <dbl> <dbl>
1 2001 Jan 2001-01-01 2505011 NUS 1 1.56e6 9.47e5 Jan 2.33e6 1.80e5
2 2001 Feb 2001-02-01 2156873 NUS 2 1.56e6 5.96e5 Feb 2.00e6 1.59e5
3 2001 Mar 2001-03-01 2086568 NUS 3 1.56e6 5.23e5 Mar 1.80e6 2.91e5
4 2001 Apr 2001-04-01 1663832 NUS 4 1.57e6 9.68e4 Apr 1.37e6 2.96e5
5 2001 May 2001-05-01 1385163 NUS 5 1.57e6 -1.85e5 May 1.21e6 1.78e5
6 2001 Jun 2001-06-01 1313119 NUS 6 1.57e6 -2.60e5 Jun 1.22e6 9.50e4
# ℹ 1 more variable: structural_break <dbl>
|> tail() ts2
# A tsibble: 6 x 11 [1M]
index date y series_id trend fit1 res1 month fit2 res2
<mth> <date> <int> <chr> <int> <dbl> <dbl> <ord> <dbl> <dbl>
1 2024 Nov 2024-11-01 2.45e6 NUS 287 2.46e6 -1.16e4 Nov 2.44e6 1.33e4
2 2024 Dec 2024-12-01 3.06e6 NUS 288 2.46e6 5.97e5 Dec 2.93e6 1.29e5
3 2025 Jan 2025-01-01 3.61e6 NUS 289 2.47e6 1.14e6 Jan 3.23e6 3.81e5
4 2025 Feb 2025-02-01 2.97e6 NUS 290 2.47e6 4.96e5 Feb 2.90e6 6.43e4
5 2025 Mar 2025-03-01 2.48e6 NUS 291 2.47e6 2.53e3 Mar 2.70e6 -2.23e5
6 2025 Apr 2025-04-01 2.13e6 NUS 292 2.48e6 -3.46e5 Apr 2.27e6 -1.40e5
# ℹ 1 more variable: structural_break <dbl>
<- lm(y ~ trend + month + structural_break, data = ts2)
mds3
summary(mds3)
Call:
lm(formula = y ~ trend + month + structural_break, data = ts2)
Residuals:
Min 1Q Median 3Q Max
-513310 -69642 -11897 75605 359602
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1606736.2 15935.8 100.825 < 2e-16 ***
trend 2466.7 128.9 19.137 < 2e-16 ***
month.L -462405.2 23825.1 -19.408 < 2e-16 ***
month.Q 1107749.4 23909.1 46.332 < 2e-16 ***
month.C 26359.3 23854.2 1.105 0.27011
month^4 205897.9 23877.3 8.623 5.05e-16 ***
month^5 278204.4 23886.1 11.647 < 2e-16 ***
month^6 -89335.4 23861.3 -3.744 0.00022 ***
month^7 -209935.3 23896.4 -8.785 < 2e-16 ***
month^8 79257.7 23891.7 3.317 0.00103 **
month^9 42418.4 23880.3 1.776 0.07678 .
month^10 73968.0 23959.0 3.087 0.00222 **
month^11 -16984.8 24036.0 -0.707 0.48038
structural_break 164703.4 24373.3 6.758 8.22e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 117900 on 278 degrees of freedom
Multiple R-squared: 0.9409, Adjusted R-squared: 0.9381
F-statistic: 340.4 on 13 and 278 DF, p-value: < 2.2e-16
As you can noticed from the model summary, adding the seasonal feature improved the model goodness of fit.
<- predict(object = mds3, newdata = ts2, interval = "confidence", level = 0.95)
fit_s3 $fit3 <- fit_s3[, 1]
ts2<- p |>
p3 add_lines(x = ts2$date, y = ts2$fit3, mode = "lines", line = list(color = "black", dash = "dash"), name = "Fitted")
p3
let’s plot the residuals:
$res3 <- ts2$y - ts2$fit3
ts2
plot_ly(x = ts2$date, y = ts2$res3, type = "scatter", mode = "markers")
Let’s generate a forecast for the next 5 years (60 months):
<- 60
h <- new_data(ts2, n = h)
future_data future_data
# A tsibble: 60 x 1 [1M]
index
<mth>
1 2025 May
2 2025 Jun
3 2025 Jul
4 2025 Aug
5 2025 Sep
6 2025 Oct
7 2025 Nov
8 2025 Dec
9 2026 Jan
10 2026 Feb
# ℹ 50 more rows
We will have to build all the features we used so far:
<- max(ts2$trend) + 1
trend_start <- trend_start + h - 1
trend_end $trend <- trend_start:trend_end
future_data$month <- lubridate::month(future_data$index, label = TRUE)
future_data$structural_break <- 1
future_data$date <- as.Date(future_data$index)
future_data future_data
# A tsibble: 60 x 5 [1M]
index trend month structural_break date
<mth> <int> <ord> <dbl> <date>
1 2025 May 293 May 1 2025-05-01
2 2025 Jun 294 Jun 1 2025-06-01
3 2025 Jul 295 Jul 1 2025-07-01
4 2025 Aug 296 Aug 1 2025-08-01
5 2025 Sep 297 Sep 1 2025-09-01
6 2025 Oct 298 Oct 1 2025-10-01
7 2025 Nov 299 Nov 1 2025-11-01
8 2025 Dec 300 Dec 1 2025-12-01
9 2026 Jan 301 Jan 1 2026-01-01
10 2026 Feb 302 Feb 1 2026-02-01
# ℹ 50 more rows
Now we can use the predict
function to create the forecast:
<- predict(object = mds3, newdata = future_data, interval = "prediction", level = 0.95)
fcs1
$yhat <- fcs1[,1]
future_data$lower <- fcs1[,2]
future_data$upper <- fcs1[,3]
future_data future_data
# A tsibble: 60 x 8 [1M]
index trend month structural_break date yhat lower upper
<mth> <int> <ord> <dbl> <date> <dbl> <dbl> <dbl>
1 2025 May 293 May 1 2025-05-01 2134082. 1895983. 2372180.
2 2025 Jun 294 Jun 1 2025-06-01 2144918. 1906819. 2383017.
3 2025 Jul 295 Jul 1 2025-07-01 2335056. 2096957. 2573154.
4 2025 Aug 296 Aug 1 2025-08-01 2346698. 2108600. 2584797.
5 2025 Sep 297 Sep 1 2025-09-01 2131186. 1893134. 2369237.
6 2025 Oct 298 Oct 1 2025-10-01 2214975. 1976923. 2453027.
7 2025 Nov 299 Nov 1 2025-11-01 2490166. 2252114. 2728218.
8 2025 Dec 300 Dec 1 2025-12-01 2986358. 2748306. 3224410.
9 2026 Jan 301 Jan 1 2026-01-01 3280937. 3042985. 3518890.
10 2026 Feb 302 Feb 1 2026-02-01 2953839. 2715887. 3191792.
# ℹ 50 more rows
Let’s plot the forecast
|>
p3 add_ribbons(x = future_data$date, ymin = future_data$lower,
ymax = future_data$upper, name = "95% Prediction Interval",
line = list(color = 'rgba(7, 164, 181, 0.05)'),
fillcolor = 'rgba(7, 164, 181, 0.2)') |>
add_lines(x = future_data$date,
y = future_data$yhat,
name = "Forecast",
line = list(color = "black", dash = "dash"))
Handling Outliers
We can notices that the series has some years where the consumption is higher than normal years. We can segment two groups of outliers:
- High: Jan 2025
- Medium: Jan 2014, Jan 2018, Jan 2019, Jan 2022, Jan 2024
Let’s create the features:
$outlier_high <- ifelse(ts2$date == as.Date("2025-01-01"), 1, 0)
ts2<- c(as.Date("2014-01-01"), as.Date("2018-01-01"),as.Date("2019-01-01"), as.Date("2022-01-01"),as.Date("2024-01-01"))
outlier_medium_dates $outlier_medium <- ifelse(ts2$date %in% outlier_medium_dates, 1, 0)
ts2
table(ts2$outlier_high)
0 1
291 1
table(ts2$outlier_medium)
0 1
287 5
Let’s train the model with the new features:
<- lm(y ~ trend + month + structural_break + outlier_high + outlier_medium, data = ts2)
mds4
summary(mds4)
Call:
lm(formula = y ~ trend + month + structural_break + outlier_high +
outlier_medium, data = ts2)
Residuals:
Min 1Q Median 3Q Max
-452244 -68307 -8478 70375 277148
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1611430.0 15218.6 105.886 < 2e-16 ***
trend 2401.6 123.5 19.447 < 2e-16 ***
month.L -431143.8 23482.6 -18.360 < 2e-16 ***
month.Q 1074312.6 23671.1 45.385 < 2e-16 ***
month.C 57400.9 23499.8 2.443 0.015209 *
month^4 180808.9 23260.8 7.773 1.52e-13 ***
month^5 295860.7 23021.2 12.852 < 2e-16 ***
month^6 -100247.7 22845.7 -4.388 1.63e-05 ***
month^7 -203941.0 22812.7 -8.940 < 2e-16 ***
month^8 76367.1 22786.3 3.351 0.000916 ***
month^9 43657.7 22770.0 1.917 0.056228 .
month^10 73445.8 22844.0 3.215 0.001459 **
month^11 -16853.5 22917.2 -0.735 0.462716
structural_break 161903.8 23248.4 6.964 2.42e-11 ***
outlier_high 438310.4 116133.7 3.774 0.000197 ***
outlier_medium 248928.4 57084.5 4.361 1.83e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 112400 on 276 degrees of freedom
Multiple R-squared: 0.9466, Adjusted R-squared: 0.9438
F-statistic: 326.5 on 15 and 276 DF, p-value: < 2.2e-16
Let’s refit the model and plot the residuals:
<- predict(object = mds4, newdata = ts2, interval = "confidence", level = 0.95)
fit_s4 $fit4 <- fit_s4[, 1]
ts2<- p |>
p4 add_lines(x = ts2$date, y = ts2$fit4, mode = "lines", line = list(color = "black", dash = "dash"), name = "Fitted")
p4
$res4 <- ts2$y - ts2$fit4
ts2
plot_ly(x = ts2$date, y = ts2$res4, type = "scatter", mode = "markers")
What else can be done?
- Segment the residual with standard deviation
- Add AR model to capture the correlation between residuals
Simulation
Often, there are unexpected events that we won’t know if and when they will occur in the future. We saw in the series that there are years that the consumption during the month of January is significantly higher. One approach to mitigate this risk is to use external information about future behavior such as weather prediction. Alternatively, we can set some assumption and about the probability of those events in the future and conduct simulation.
In the following example, we will illustrate the impact of training outlier and add its impact to the forecast. Let’s first create a baseline forecast:
$outlier_high <- 0
future_data$outlier_medium <- 0
future_data
<- predict(object = mds4, newdata = future_data, interval = "prediction", level = 0.95)
fcs2
$yhat2 <- fcs2[,1]
future_data$lower2 <- fcs2[,2]
future_data$upper2 <- fcs2[,3]
future_data
|>
p4 add_ribbons(x = future_data$date, ymin = future_data$lower2,
ymax = future_data$upper2, name = "95% Prediction Interval",
line = list(color = 'rgba(7, 164, 181, 0.05)'),
fillcolor = 'rgba(7, 164, 181, 0.2)') |>
add_lines(x = future_data$date,
y = future_data$yhat2,
name = "Forecast",
line = list(color = "black", dash = "dash"))
Now, let’s assume that: - High consumption is expected in January 2029 - Medium consumption is expected in January 2027
Let’s update the features on the future data set:
$outlier_high <-ifelse(future_data$date == "2029-01-01", 1, future_data$outlier_high)
future_data
$outlier_medium <-ifelse(future_data$date == "2027-01-01", 1, future_data$outlier_medium) future_data
<- predict(object = mds4, newdata = future_data, interval = "prediction", level = 0.95)
fcs3
$yhat3 <- fcs3[,1]
future_data$lower3 <- fcs3[,2]
future_data$upper3 <- fcs3[,3]
future_data
|>
p4 add_ribbons(x = future_data$date, ymin = future_data$lower3,
ymax = future_data$upper3, name = "95% Prediction Interval",
line = list(color = 'rgba(7, 164, 181, 0.05)'),
fillcolor = 'rgba(7, 164, 181, 0.2)') |>
add_lines(x = future_data$date,
y = future_data$yhat3,
name = "Forecast",
line = list(color = "black", dash = "dash"))