Data Prep

This document provides the steps of loading a dataset from a CSV file and reformating it to a time series object following Nixtla format. This includes the following steps:

Data

We will use a preload dataset of the US hourly demand for electricity sourced from the EIA API.

The header of the query from the EIA API:

{
    "frequency": "hourly",
    "data": [
        "value"
    ],
    "facets": {
        "respondent": [
            "US48"
        ],
        "type": [
            "D"
        ]
    },
    "start": null,
    "end": null,
    "sort": [
        {
            "column": "period",
            "direction": "desc"
        }
    ],
    "offset": 0,
    "length": 5000
}

Note: The EIA API has a 5000-observation limit per call. Therefore, if you wish to pull the full series history, you will have to send multiple GET requests by iterate over the date range.

Required Libraries

Throughout this notebook, we will use the pandas library to handle the data and the datetime library to reformat the object timestamp:

import pandas as pd
import datetime

Load the Raw Data

We will load the US hourly demand for electricity from a flat file using the pandas’ read_csv function:

data = pd.read_csv("data/us48.csv")
data = data.rename(columns={"index": "period"})

Let’s use the head and dtypes methods to review the object attributes:

print(data.head())
                period respondent type   value    value-units         y
0  2019-01-01 00:00:00       US48    D  461392  megawatthours  461392.0
1  2019-01-01 01:00:00       US48    D  459577  megawatthours  459577.0
2  2019-01-01 02:00:00       US48    D  451601  megawatthours  451601.0
3  2019-01-01 03:00:00       US48    D  437803  megawatthours  437803.0
4  2019-01-01 04:00:00       US48    D  422742  megawatthours  422742.0

The raw data has the following seven columns:

  • period - the series timestamp
  • respondent - the series code
  • respondent-name - the series name
  • type - the series type code
  • type-name - the series type name
  • value - the series values
  • value-units - the series values units
print(data.dtypes)
period          object
respondent      object
type            object
value            int64
value-units     object
y              float64
dtype: object

As you can notice from the above output, the series value is set as a float64 (e.g., numeric), however, the series timestamp - period is a string. Therefore, we will have to reformat it into a datetime object using the to_datetime function:

data["period"] = pd.to_datetime(data["period"])

We can validate the object format after the conversion of the period column:

print(data.dtypes)
period         datetime64[ns]
respondent             object
type                   object
value                   int64
value-units            object
y                     float64
dtype: object

Reformat the DataFrame to Nixtla Input format

Nixtla libraries inputs use the following there columns data structure:

  • unique_id - the series ID, enabling the use of a time series object with multiple time series
  • ds - the series timestamp
  • y - the series values

Let’s reformat the DataFrame object by reducing unused dimensions and keeping the series timestamp and values:

ts = data[["period", "value"]]
ts = ts.sort_values("period")
ts = ts.rename(columns = {"period": "ds", "value": "y"})
ts["unique_id"] = "1"

ts.head()
ts.tail()
ds y unique_id
52604 2024-12-31 20:00:00 456107 1
52605 2024-12-31 21:00:00 457670 1
52606 2024-12-31 22:00:00 464025 1
52607 2024-12-31 23:00:00 473192 1
52608 2025-01-01 00:00:00 477136 1

Let’s now plot the series using the plot function from the StatsForecast module:

from statsforecast import StatsForecast

StatsForecast.plot(ts)

By default, the plot series uses the matplotlib library as the visualization engine. You can modify it to plotly using the engine argument:

StatsForecast.plot(ts, engine = "plotly").update_layout(height=300)

Subsetting the series

As we want to use later two full yearly cycles to train the models, we will subset the series to the last 25 months (leaving at least a month for testing). Let’s start by defining the end time of the series as the last full day (e.g., that has at least 24 hours) by finding the most recent timestamp floor and subtracting 1 hour.

end = ts["ds"].max().floor(freq = "d") - datetime.timedelta(hours = 1)

end = datetime.datetime(2024, 12, 1, 0, 0) - datetime.timedelta(hours = 1)
print(end)
2024-11-30 23:00:00

Next, let’s subtract from the end value 25 months:

start = end - datetime.timedelta(hours = 24 * 30 * 25)
print(start)
2022-11-11 23:00:00

Now, we can use the start and end values to subset the series:

df = ts[(ts["ds"] <= end) & (ts["ds"] >= start)]

df.tail()
ds y unique_id
51859 2024-11-30 19:00:00 460592 1
51860 2024-11-30 20:00:00 456492 1
51861 2024-11-30 21:00:00 455024 1
51862 2024-11-30 22:00:00 462062 1
51863 2024-11-30 23:00:00 476419 1

Reploting the subset series:

StatsForecast.plot(df, engine = "plotly").update_layout(height=300)
df2 = df.copy()

df2 = df2[["ds", "y"]]
df2 = df2.rename(columns = {"ds": "period"})
df2.head()
period y
33863 2022-11-11 23:00:00 456403
33864 2022-11-12 00:00:00 458842
33865 2022-11-12 01:00:00 455111
33866 2022-11-12 02:00:00 448035
33867 2022-11-12 03:00:00 438165

Saving the Data Back to CSV

last but not least, let’s save the series:

df.to_csv("data/data.csv", index = False)
df2.to_csv("data/data2.csv", index = False)
ts.to_csv("data/ts_data.csv", index = False)