Forecasting with Linear Regression

Author

Rami Krispin

Published

July 5, 2025

Required Libraries

library(feasts)
library(fabletools)
library(tsibble)
library(dplyr)
library(plotly)

Loading supporting functions:

source("./functions.R")

Load Data

load(file = "./data/ts.RData")

Modeling Trend

head(ts1)
# A tsibble: 6 x 3 [1Y]
  index       y series_id
  <int>   <int> <chr>    
1  1986    7626 SCA      
2  1987 7904858 SCA      
3  1988 8113034 SCA      
4  1989 8313776 SCA      
5  1990 8497848 SCA      
6  1991 8634774 SCA      
ts1 <- ts1 |> 
dplyr::filter(index > 1986)


p <- plot_ly() |>
add_lines(x = ts1$index, y = ts1$y, type = "scatter", mode = "lines", name = "Actual") 

p

Fitting a Linear Trend

To model the trend we will create an index variable and use linear regression model fit the index against the dependent variable - the series of interest:

ts1 <- ts1 |> 
dplyr::mutate(trend = 1:nrow(ts1))

head(ts1)
# A tsibble: 6 x 4 [1Y]
  index       y series_id trend
  <int>   <int> <chr>     <int>
1  1987 7904858 SCA           1
2  1988 8113034 SCA           2
3  1989 8313776 SCA           3
4  1990 8497848 SCA           4
5  1991 8634774 SCA           5
6  1992 8680613 SCA           6

Let’s now set a regression model:

md1 <- lm(y ~ trend, data = ts1)

summary(md1)

Call:
lm(formula = y ~ trend, data = ts1)

Residuals:
    Min      1Q  Median      3Q     Max 
-298308  -94696  -14364   98742  332048 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8107966      54683  148.27   <2e-16 ***
trend          95200       2509   37.94   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 163000 on 35 degrees of freedom
Multiple R-squared:  0.9763,    Adjusted R-squared:  0.9756 
F-statistic:  1440 on 1 and 35 DF,  p-value: < 2.2e-16

We will fit the model on the series to see the trend fit:

fit1 <- predict(object = md1, newdata = ts1,  interval = "confidence", level = 0.95)
ts1$fit1 <- fit1[, 1]
p1 <- p |> 
add_lines(x = ts1$index, y = ts1$fit1, mode = "lines", line = list(color = "black", dash = "dash"), name = "Fitted") 

p1

Residuals Analysis

The goal of the residual analysis is to check the goodness of fit of the model with data the it was encountered during the fit process. The residual could indicate whether the model has a good fit or or some patterns left and a direction about the type of features that are needed for better predictions.

ts1$res1 <- ts1$y - ts1$fit1


plot_ly(x = ts1$index, y = ts1$res1, type = "scatter", mode = "markers")

Clearly, you can see that some patterns are left in the residuals. We can check for autocorrelation:

plot_acf(ts = ts1, var = "res1", lag_max = 60, frequency = NULL, alpha = 0.05)

The ACF plot indicates that the residuals have a high correlation with the first lags. This is classic case, that adding an ARIMA model (i.e., regression with ARIMA errors) could improve the fit of the model.

Forecast the Trend

To forecast the future observations of the series, we will have to generate a new data frame with the corresponding features we used to train the model. Let’ say that we want to generate a 5 year forecast. We will use the new_data function from the tsibble library to generate a new data frame with the corresponding features for the next 5 years:

h <- 5
future_data <- new_data(ts1, n = h)
future_data
# A tsibble: 5 x 1 [1Y]
  index
  <dbl>
1  2024
2  2025
3  2026
4  2027
5  2028

Next, we will generate the trend feature:

trend_start <- max(ts1$trend) + 1
trend_end <- trend_start + h - 1
future_data$trend <- trend_start:trend_end
future_data
# A tsibble: 5 x 2 [1Y]
  index trend
  <dbl> <int>
1  2024    38
2  2025    39
3  2026    40
4  2027    41
5  2028    42

Now we can use the predict function to create the forecast:

fc1 <- predict(object = md1, newdata = future_data, interval = "prediction", level = 0.95)

future_data$yhat <- fc1[,1]
future_data$lower <- fc1[,2]
future_data$upper <- fc1[,3]
future_data
# A tsibble: 5 x 5 [1Y]
  index trend      yhat     lower     upper
  <dbl> <int>     <dbl>     <dbl>     <dbl>
1  2024    38 11725578. 11376638. 12074518.
2  2025    39 11820778. 11470391. 12171165.
3  2026    40 11915978. 11564077. 12267880.
4  2027    41 12011178. 11657695. 12364662.
5  2028    42 12106379. 11751248. 12461510.

Let’s plot the forecast

p1 |> 
add_ribbons(x = future_data$index, ymin = future_data$lower,
ymax = future_data$upper, name = "95% Prediction Interval",
 line = list(color = 'rgba(7, 164, 181, 0.05)'),
              fillcolor = 'rgba(7, 164, 181, 0.2)') |>
add_lines(x = future_data$index, 
y = future_data$yhat, 
name = "Forecast", 
line = list(color = "black", dash = "dash"))

Improving the model

Reviewing the results, there are two features we can use to improve the forecast:

  • Piecewise linear trend
  • Adding AR model
  • Combing both approaches

Piecewise Linear Trend

Eyeballing the series, you can notice a change in the trend at around 2008. We will use this information to create a piecewise linear trend.

s <- ts1$trend[which(ts1$index == 2008)]
ts1$trend2 <- pmax(0, ts1$trend - s)

md2 <- lm(y ~ trend + trend2, data = ts1)

summary(md2)

Call:
lm(formula = y ~ trend + trend2, data = ts1)

Residuals:
    Min      1Q  Median      3Q     Max 
-144490  -28221   -6315   24236  165205 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7888615      30832  255.86  < 2e-16 ***
trend         116191       2107   55.14  < 2e-16 ***
trend2        -55336       4692  -11.79 1.45e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 73280 on 34 degrees of freedom
Multiple R-squared:  0.9953,    Adjusted R-squared:  0.9951 
F-statistic:  3629 on 2 and 34 DF,  p-value: < 2.2e-16

Let’s fit the model and plot it:

fit2<- predict(object = md2, newdata = ts1,  interval = "confidence", level = 0.95)
ts1$fit2 <- fit2[, 1]
p2 <- p |> 
add_lines(x = ts1$index, y = ts1$fit2, mode = "lines", line = list(color = "black", dash = "dash"), name = "Fitted") 
p2

We will add to the plot the features:

trend1 <- plot_ly() |>
add_lines(x = ts1$index, y = ts1$trend, name = "Main Trend Feature")

trend2 <- plot_ly() |>
add_lines(x = ts1$index, y = ts1$trend2, name = "Piecewise Trend Feature")
subplot(p2, trend1, trend2, nrows = 3)

Let’s review the residuals:

ts1$res2 <- ts1$y - ts1$fit2


plot_ly(x = ts1$index, y = ts1$res2, type = "scatter", mode = "markers")

And the ACF of the residuals:

plot_acf(ts = ts1, var = "res2", lag_max = 60, frequency = NULL, alpha = 0.05)

We can noticed from the residuals and ACF plots that the model goodness of fit improved with the addition of a second trend feature. Yet, the ACF plot indicate that some correlation left on the residuals within the first lag. An AR model with order 1 can help to remove this correlation.

AR model

The Auto Regressive (AR) model is a common approach in time series forecasting to model time series with high correlation with its past lags. The AR model is part of a family of models that construct the ARIMA family of models.

At its core, AR model is a regression of the series with its previous lags. For example, AR 1 model would be a regression of the series with its lag one:

\[ Y_t = \alpha + \beta * Y_{t-1} \]

Let’s create a lag 1 feature:

ts1$lag1 <- dplyr::lag(ts1$y, n = 1)

ts1
# A tsibble: 37 x 10 [1Y]
   index       y series_id trend     fit1     res1 trend2   fit2    res2    lag1
   <int>   <int> <chr>     <int>    <dbl>    <dbl>  <dbl>  <dbl>   <dbl>   <int>
 1  1987 7904858 SCA           1 8203166. -298308.      0 8.00e6 -99948.      NA
 2  1988 8113034 SCA           2 8298367. -185333.      0 8.12e6  -7963. 7904858
 3  1989 8313776 SCA           3 8393567.  -79791.      0 8.24e6  76588. 8113034
 4  1990 8497848 SCA           4 8488767.    9081.      0 8.35e6 144469. 8313776
 5  1991 8634774 SCA           5 8583968.   50806.      0 8.47e6 165205. 8497848
 6  1992 8680613 SCA           6 8679168.    1445.      0 8.59e6  94853. 8634774
 7  1993 8726187 SCA           7 8774368.  -48181.      0 8.70e6  24236. 8680613
 8  1994 8790733 SCA           8 8869569.  -78836.      0 8.82e6 -27409. 8726187
 9  1995 8865541 SCA           9 8964769.  -99228.      0 8.93e6 -68792. 8790733
10  1996 8969308 SCA          10 9059969.  -90661.      0 9.05e6 -81215. 8865541
# ℹ 27 more rows

Let’s now fit the model:

md3 <- lm(y ~ trend + lag1, data = ts1)

summary(md3)

Call:
lm(formula = y ~ trend + lag1, data = ts1)

Residuals:
    Min      1Q  Median      3Q     Max 
-114027  -37159    3657   32800  126391 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7.944e+05  5.156e+05   1.541    0.133    
trend       5.062e+03  6.291e+03   0.805    0.427    
lag1        9.193e-01  6.443e-02  14.267 1.14e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 59240 on 33 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.9967,    Adjusted R-squared:  0.9965 
F-statistic:  4970 on 2 and 33 DF,  p-value: < 2.2e-16

We will refit the model and plot it:

fit3<- predict(object = md3, newdata = ts1,  interval = "confidence", level = 0.95)
ts1$fit3 <- fit3[, 1]
p3 <- p |> 
add_lines(x = ts1$index, y = ts1$fit3, mode = "lines", line = list(color = "black", dash = "dash"), name = "Fitted") 

Let’s review the residual ACF plot:

ts1$res3 <- ts1$y - ts1$fit3

plot_acf(ts = ts1, var = "res3", lag_max = 60, frequency = NULL, alpha = 0.05)
Using Lags

The main downside of using lags are that you will have to produce the lags for your forecast assuming the lag number is smaller than the forecast horizon. In this case, you will have to use a recursive function and use previous predication values as lags for your next prediction.

A great article about piecewise linear trends modeling can be found in the following blog post by Prof. Rob J Hyndman.

Modeling Seasonal Time Series

Let’s now see how can we model a seasonal time series with linear regression.

p <- plot_ly() |>
add_lines(x = ts2$date, y = ts2$y, name = "Actual")

p

Let’s decompose the series:

stl_d <- ts2 |> model( STL(y)) |> components()

plot_decomposition(stl_d, var = "y", outliers = TRUE)

We will build the features step-by-step to review the impact of each feature on the goodness of fit. Let’s start by defining the trend feature:

ts2$trend <- 1:nrow(ts2)

ts2
# A tsibble: 292 x 5 [1M]
      index date             y series_id trend
      <mth> <date>       <int> <chr>     <int>
 1 2001 Jan 2001-01-01 2505011 NUS           1
 2 2001 Feb 2001-02-01 2156873 NUS           2
 3 2001 Mar 2001-03-01 2086568 NUS           3
 4 2001 Apr 2001-04-01 1663832 NUS           4
 5 2001 May 2001-05-01 1385163 NUS           5
 6 2001 Jun 2001-06-01 1313119 NUS           6
 7 2001 Jul 2001-07-01 1459919 NUS           7
 8 2001 Aug 2001-08-01 1528483 NUS           8
 9 2001 Sep 2001-09-01 1360871 NUS           9
10 2001 Oct 2001-10-01 1507428 NUS          10
# ℹ 282 more rows
mds1  <- lm(y ~ trend, data = ts2)

summary(mds1)

Call:
lm(formula = y ~ trend, data = ts2)

Residuals:
    Min      1Q  Median      3Q     Max 
-520488 -312830 -136594  304148 1143010 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1554363.7    46010.9   33.78   <2e-16 ***
trend          3161.1      272.2   11.61   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 392100 on 290 degrees of freedom
Multiple R-squared:  0.3174,    Adjusted R-squared:  0.315 
F-statistic: 134.8 on 1 and 290 DF,  p-value: < 2.2e-16

Let’s fit the model and plot it:

fit_s1 <- predict(object = mds1, newdata = ts2,  interval = "confidence", level = 0.95)
ts2$fit1 <- fit_s1[, 1]
p1 <- p |> 
add_lines(x = ts2$date, y = ts2$fit1, mode = "lines", line = list(color = "black", dash = "dash"), name = "Fitted") 

p1
ts2$res1 <- ts2$y - ts2$fit1
Detranding the Series

If you check the model residuals, you will noticed that the residuals are detrained version of the original series.

We will now add the seasonal feature by creating a categorical variable for each month of the year:

ts2$month <- lubridate::month(ts2$index, label = TRUE)
ts2
# A tsibble: 292 x 8 [1M]
      index date             y series_id trend     fit1     res1 month
      <mth> <date>       <int> <chr>     <int>    <dbl>    <dbl> <ord>
 1 2001 Jan 2001-01-01 2505011 NUS           1 1557525.  947486. Jan  
 2 2001 Feb 2001-02-01 2156873 NUS           2 1560686.  596187. Feb  
 3 2001 Mar 2001-03-01 2086568 NUS           3 1563847.  522721. Mar  
 4 2001 Apr 2001-04-01 1663832 NUS           4 1567008.   96824. Apr  
 5 2001 May 2001-05-01 1385163 NUS           5 1570169. -185006. May  
 6 2001 Jun 2001-06-01 1313119 NUS           6 1573330. -260211. Jun  
 7 2001 Jul 2001-07-01 1459919 NUS           7 1576492. -116573. Jul  
 8 2001 Aug 2001-08-01 1528483 NUS           8 1579653.  -51170. Aug  
 9 2001 Sep 2001-09-01 1360871 NUS           9 1582814. -221943. Sep  
10 2001 Oct 2001-10-01 1507428 NUS          10 1585975.  -78547. Oct  
# ℹ 282 more rows
mds2  <- lm(y ~ trend + month, data = ts2)

summary(mds2)

Call:
lm(formula = y ~ trend + month, data = ts2)

Residuals:
    Min      1Q  Median      3Q     Max 
-502902  -80824   -9854   89213  381288 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1553263.86   14898.52 104.256  < 2e-16 ***
trend          3139.58      88.15  35.617  < 2e-16 ***
month.L     -462479.01   25661.43 -18.022  < 2e-16 ***
month.Q     1112284.49   25741.67  43.209  < 2e-16 ***
month.C       25030.45   25691.87   0.974 0.330774    
month^4      202426.46   25711.62   7.873 7.72e-14 ***
month^5      277448.53   25726.76  10.784  < 2e-16 ***
month^6      -87480.59   25698.73  -3.404 0.000761 ***
month^7     -208327.63   25736.91  -8.095 1.79e-14 ***
month^8       79379.32   25733.10   3.085 0.002242 ** 
month^9       41075.62   25719.98   1.597 0.111391    
month^10      71934.03   25803.57   2.788 0.005672 ** 
month^11     -17836.16   25888.14  -0.689 0.491414    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 126900 on 279 degrees of freedom
Multiple R-squared:  0.9312,    Adjusted R-squared:  0.9282 
F-statistic: 314.6 on 12 and 279 DF,  p-value: < 2.2e-16

As you can noticed from the model summary, adding the seasonal feature improved the model goodness of fit.

fit_s2 <- predict(object = mds2, newdata = ts2,  interval = "confidence", level = 0.95)
ts2$fit2 <- fit_s2[, 1]
p2 <- p |> 
add_lines(x = ts2$date, y = ts2$fit2, mode = "lines", line = list(color = "black", dash = "dash"), name = "Fitted") 

p2

Let’s review now the model residuals:

ts2$res2 <- ts2$y - ts2$fit2


plot_ly(x = ts2$date, y = ts2$res2, type = "scatter", mode = "markers")

And the ACF of the residuals:

plot_acf(ts = ts2, var = "res2", lag_max = 60, frequency = NULL, alpha = 0.05)

The is a ton of information that left on the residuals!

We will explore the following features:

  • Structural breaks
  • Piecewise Linear Trend
  • AR
  • Outliers

Structural Break Feature

Often, time series would have an abrupt shift in its structure due to some event. A good example, is the Covid-19 pandemic impact on some macro-economic indicators such as the unemployment rate and number of passengers traveling on airplanes. At the end of this change, the series would be in a new level.

We can notice that the natural gas series had a structural break on the trend component around September 2018. To model a structural break, we will create a new variable that is equal to 0 prior to September 2018 and 1 after this date:

ts2$structural_break <- ifelse(ts2$date >= as.Date("2018-09-01"), 1, 0)

ts2 |> head()
# A tsibble: 6 x 11 [1M]
     index date             y series_id trend   fit1    res1 month   fit2   res2
     <mth> <date>       <int> <chr>     <int>  <dbl>   <dbl> <ord>  <dbl>  <dbl>
1 2001 Jan 2001-01-01 2505011 NUS           1 1.56e6  9.47e5 Jan   2.33e6 1.80e5
2 2001 Feb 2001-02-01 2156873 NUS           2 1.56e6  5.96e5 Feb   2.00e6 1.59e5
3 2001 Mar 2001-03-01 2086568 NUS           3 1.56e6  5.23e5 Mar   1.80e6 2.91e5
4 2001 Apr 2001-04-01 1663832 NUS           4 1.57e6  9.68e4 Apr   1.37e6 2.96e5
5 2001 May 2001-05-01 1385163 NUS           5 1.57e6 -1.85e5 May   1.21e6 1.78e5
6 2001 Jun 2001-06-01 1313119 NUS           6 1.57e6 -2.60e5 Jun   1.22e6 9.50e4
# ℹ 1 more variable: structural_break <dbl>
ts2 |> tail()
# A tsibble: 6 x 11 [1M]
     index date            y series_id trend   fit1    res1 month   fit2    res2
     <mth> <date>      <int> <chr>     <int>  <dbl>   <dbl> <ord>  <dbl>   <dbl>
1 2024 Nov 2024-11-01 2.45e6 NUS         287 2.46e6 -1.16e4 Nov   2.44e6  1.33e4
2 2024 Dec 2024-12-01 3.06e6 NUS         288 2.46e6  5.97e5 Dec   2.93e6  1.29e5
3 2025 Jan 2025-01-01 3.61e6 NUS         289 2.47e6  1.14e6 Jan   3.23e6  3.81e5
4 2025 Feb 2025-02-01 2.97e6 NUS         290 2.47e6  4.96e5 Feb   2.90e6  6.43e4
5 2025 Mar 2025-03-01 2.48e6 NUS         291 2.47e6  2.53e3 Mar   2.70e6 -2.23e5
6 2025 Apr 2025-04-01 2.13e6 NUS         292 2.48e6 -3.46e5 Apr   2.27e6 -1.40e5
# ℹ 1 more variable: structural_break <dbl>
mds3  <- lm(y ~ trend + month + structural_break, data = ts2)

summary(mds3)

Call:
lm(formula = y ~ trend + month + structural_break, data = ts2)

Residuals:
    Min      1Q  Median      3Q     Max 
-513310  -69642  -11897   75605  359602 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      1606736.2    15935.8 100.825  < 2e-16 ***
trend               2466.7      128.9  19.137  < 2e-16 ***
month.L          -462405.2    23825.1 -19.408  < 2e-16 ***
month.Q          1107749.4    23909.1  46.332  < 2e-16 ***
month.C            26359.3    23854.2   1.105  0.27011    
month^4           205897.9    23877.3   8.623 5.05e-16 ***
month^5           278204.4    23886.1  11.647  < 2e-16 ***
month^6           -89335.4    23861.3  -3.744  0.00022 ***
month^7          -209935.3    23896.4  -8.785  < 2e-16 ***
month^8            79257.7    23891.7   3.317  0.00103 ** 
month^9            42418.4    23880.3   1.776  0.07678 .  
month^10           73968.0    23959.0   3.087  0.00222 ** 
month^11          -16984.8    24036.0  -0.707  0.48038    
structural_break  164703.4    24373.3   6.758 8.22e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 117900 on 278 degrees of freedom
Multiple R-squared:  0.9409,    Adjusted R-squared:  0.9381 
F-statistic: 340.4 on 13 and 278 DF,  p-value: < 2.2e-16

As you can noticed from the model summary, adding the seasonal feature improved the model goodness of fit.

fit_s3 <- predict(object = mds3, newdata = ts2,  interval = "confidence", level = 0.95)
ts2$fit3 <- fit_s3[, 1]
p3 <- p |> 
add_lines(x = ts2$date, y = ts2$fit3, mode = "lines", line = list(color = "black", dash = "dash"), name = "Fitted") 

p3

let’s plot the residuals:

ts2$res3 <- ts2$y - ts2$fit3


plot_ly(x = ts2$date, y = ts2$res3, type = "scatter", mode = "markers")

Let’s generate a forecast for the next 5 years (60 months):

h <- 60
future_data <- new_data(ts2, n = h)
future_data
# A tsibble: 60 x 1 [1M]
      index
      <mth>
 1 2025 May
 2 2025 Jun
 3 2025 Jul
 4 2025 Aug
 5 2025 Sep
 6 2025 Oct
 7 2025 Nov
 8 2025 Dec
 9 2026 Jan
10 2026 Feb
# ℹ 50 more rows

We will have to build all the features we used so far:

trend_start <- max(ts2$trend) + 1
trend_end <- trend_start + h - 1
future_data$trend <- trend_start:trend_end
future_data$month <- lubridate::month(future_data$index, label = TRUE)
future_data$structural_break <- 1
future_data$date <- as.Date(future_data$index)
future_data
# A tsibble: 60 x 5 [1M]
      index trend month structural_break date      
      <mth> <int> <ord>            <dbl> <date>    
 1 2025 May   293 May                  1 2025-05-01
 2 2025 Jun   294 Jun                  1 2025-06-01
 3 2025 Jul   295 Jul                  1 2025-07-01
 4 2025 Aug   296 Aug                  1 2025-08-01
 5 2025 Sep   297 Sep                  1 2025-09-01
 6 2025 Oct   298 Oct                  1 2025-10-01
 7 2025 Nov   299 Nov                  1 2025-11-01
 8 2025 Dec   300 Dec                  1 2025-12-01
 9 2026 Jan   301 Jan                  1 2026-01-01
10 2026 Feb   302 Feb                  1 2026-02-01
# ℹ 50 more rows

Now we can use the predict function to create the forecast:

fcs1 <- predict(object = mds3, newdata = future_data, interval = "prediction", level = 0.95)

future_data$yhat <- fcs1[,1]
future_data$lower <- fcs1[,2]
future_data$upper <- fcs1[,3]
future_data
# A tsibble: 60 x 8 [1M]
      index trend month structural_break date           yhat    lower    upper
      <mth> <int> <ord>            <dbl> <date>        <dbl>    <dbl>    <dbl>
 1 2025 May   293 May                  1 2025-05-01 2134082. 1895983. 2372180.
 2 2025 Jun   294 Jun                  1 2025-06-01 2144918. 1906819. 2383017.
 3 2025 Jul   295 Jul                  1 2025-07-01 2335056. 2096957. 2573154.
 4 2025 Aug   296 Aug                  1 2025-08-01 2346698. 2108600. 2584797.
 5 2025 Sep   297 Sep                  1 2025-09-01 2131186. 1893134. 2369237.
 6 2025 Oct   298 Oct                  1 2025-10-01 2214975. 1976923. 2453027.
 7 2025 Nov   299 Nov                  1 2025-11-01 2490166. 2252114. 2728218.
 8 2025 Dec   300 Dec                  1 2025-12-01 2986358. 2748306. 3224410.
 9 2026 Jan   301 Jan                  1 2026-01-01 3280937. 3042985. 3518890.
10 2026 Feb   302 Feb                  1 2026-02-01 2953839. 2715887. 3191792.
# ℹ 50 more rows

Let’s plot the forecast

p3 |> 
add_ribbons(x = future_data$date, ymin = future_data$lower,
ymax = future_data$upper, name = "95% Prediction Interval",
 line = list(color = 'rgba(7, 164, 181, 0.05)'),
              fillcolor = 'rgba(7, 164, 181, 0.2)') |>
add_lines(x = future_data$date, 
y = future_data$yhat, 
name = "Forecast", 
line = list(color = "black", dash = "dash"))

Handling Outliers

We can notices that the series has some years where the consumption is higher than normal years. We can segment two groups of outliers:

  • High: Jan 2025
  • Medium: Jan 2014, Jan 2018, Jan 2019, Jan 2022, Jan 2024

Let’s create the features:

ts2$outlier_high <- ifelse(ts2$date == as.Date("2025-01-01"), 1, 0)
outlier_medium_dates <-  c(as.Date("2014-01-01"), as.Date("2018-01-01"),as.Date("2019-01-01"), as.Date("2022-01-01"),as.Date("2024-01-01"))
ts2$outlier_medium <- ifelse(ts2$date %in% outlier_medium_dates, 1, 0)

table(ts2$outlier_high)

  0   1 
291   1 
table(ts2$outlier_medium)

  0   1 
287   5 

Let’s train the model with the new features:

mds4  <- lm(y ~ trend + month + structural_break + outlier_high + outlier_medium, data = ts2)

summary(mds4)

Call:
lm(formula = y ~ trend + month + structural_break + outlier_high + 
    outlier_medium, data = ts2)

Residuals:
    Min      1Q  Median      3Q     Max 
-452244  -68307   -8478   70375  277148 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      1611430.0    15218.6 105.886  < 2e-16 ***
trend               2401.6      123.5  19.447  < 2e-16 ***
month.L          -431143.8    23482.6 -18.360  < 2e-16 ***
month.Q          1074312.6    23671.1  45.385  < 2e-16 ***
month.C            57400.9    23499.8   2.443 0.015209 *  
month^4           180808.9    23260.8   7.773 1.52e-13 ***
month^5           295860.7    23021.2  12.852  < 2e-16 ***
month^6          -100247.7    22845.7  -4.388 1.63e-05 ***
month^7          -203941.0    22812.7  -8.940  < 2e-16 ***
month^8            76367.1    22786.3   3.351 0.000916 ***
month^9            43657.7    22770.0   1.917 0.056228 .  
month^10           73445.8    22844.0   3.215 0.001459 ** 
month^11          -16853.5    22917.2  -0.735 0.462716    
structural_break  161903.8    23248.4   6.964 2.42e-11 ***
outlier_high      438310.4   116133.7   3.774 0.000197 ***
outlier_medium    248928.4    57084.5   4.361 1.83e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 112400 on 276 degrees of freedom
Multiple R-squared:  0.9466,    Adjusted R-squared:  0.9438 
F-statistic: 326.5 on 15 and 276 DF,  p-value: < 2.2e-16

Let’s refit the model and plot the residuals:

fit_s4 <- predict(object = mds4, newdata = ts2,  interval = "confidence", level = 0.95)
ts2$fit4 <- fit_s4[, 1]
p4 <- p |> 
add_lines(x = ts2$date, y = ts2$fit4, mode = "lines", line = list(color = "black", dash = "dash"), name = "Fitted") 

p4
ts2$res4 <- ts2$y - ts2$fit4


plot_ly(x = ts2$date, y = ts2$res4, type = "scatter", mode = "markers")

What else can be done?

  • Segment the residual with standard deviation
  • Add AR model to capture the correlation between residuals

Simulation

Often, there are unexpected events that we won’t know if and when they will occur in the future. We saw in the series that there are years that the consumption during the month of January is significantly higher. One approach to mitigate this risk is to use external information about future behavior such as weather prediction. Alternatively, we can set some assumption and about the probability of those events in the future and conduct simulation.

In the following example, we will illustrate the impact of training outlier and add its impact to the forecast. Let’s first create a baseline forecast:

future_data$outlier_high <- 0
future_data$outlier_medium <- 0

fcs2 <- predict(object = mds4, newdata = future_data, interval = "prediction", level = 0.95)

future_data$yhat2 <- fcs2[,1]
future_data$lower2 <- fcs2[,2]
future_data$upper2 <- fcs2[,3]

p4 |> 
add_ribbons(x = future_data$date, ymin = future_data$lower2,
ymax = future_data$upper2, name = "95% Prediction Interval",
 line = list(color = 'rgba(7, 164, 181, 0.05)'),
              fillcolor = 'rgba(7, 164, 181, 0.2)') |>
add_lines(x = future_data$date, 
y = future_data$yhat2, 
name = "Forecast", 
line = list(color = "black", dash = "dash"))

Now, let’s assume that: - High consumption is expected in January 2029 - Medium consumption is expected in January 2027

Let’s update the features on the future data set:

future_data$outlier_high <-ifelse(future_data$date == "2029-01-01", 1, future_data$outlier_high)

future_data$outlier_medium <-ifelse(future_data$date == "2027-01-01", 1, future_data$outlier_medium)
fcs3 <- predict(object = mds4, newdata = future_data, interval = "prediction", level = 0.95)

future_data$yhat3 <- fcs3[,1]
future_data$lower3 <- fcs3[,2]
future_data$upper3 <- fcs3[,3]

p4 |> 
add_ribbons(x = future_data$date, ymin = future_data$lower3,
ymax = future_data$upper3, name = "95% Prediction Interval",
 line = list(color = 'rgba(7, 164, 181, 0.05)'),
              fillcolor = 'rgba(7, 164, 181, 0.2)') |>
add_lines(x = future_data$date, 
y = future_data$yhat3, 
name = "Forecast", 
line = list(color = "black", dash = "dash"))