import pandas as pd
import datetime
Data Prep
This document provides the steps of loading a dataset from a CSV file and reformating it to a time series object following Nixtla format. This includes the following steps:
- Loading the raw data from a CSV file
- Reformat the data
- Filter the data by time range
Data
We will use a preload dataset of the US hourly demand for electricity sourced from the EIA API.
The header of the query from the EIA API:
{
"frequency": "hourly",
"data": [
"value"
],
"facets": {
"respondent": [
"US48"
],
"type": [
"D"
]
},
"start": null,
"end": null,
"sort": [
{
"column": "period",
"direction": "desc"
}
],
"offset": 0,
"length": 5000
}
Note: The EIA API has a 5000-observation limit per call. Therefore, if you wish to pull the full series history, you will have to send multiple GET requests by iterate over the date range.
Required Libraries
Throughout this notebook, we will use the pandas
library to handle the data and the datetime
library to reformat the object timestamp:
Load the Raw Data
We will load the US hourly demand for electricity from a flat file using the pandas’ read_csv
function:
= pd.read_csv("data/us48.csv")
data = data.rename(columns={"index": "period"}) data
Let’s use the head
and dtypes
methods to review the object attributes:
print(data.head())
period respondent type value value-units y
0 2019-01-01 00:00:00 US48 D 461392 megawatthours 461392.0
1 2019-01-01 01:00:00 US48 D 459577 megawatthours 459577.0
2 2019-01-01 02:00:00 US48 D 451601 megawatthours 451601.0
3 2019-01-01 03:00:00 US48 D 437803 megawatthours 437803.0
4 2019-01-01 04:00:00 US48 D 422742 megawatthours 422742.0
The raw data has the following seven columns:
- period - the series timestamp
- respondent - the series code
- respondent-name - the series name
- type - the series type code
- type-name - the series type name
- value - the series values
- value-units - the series values units
print(data.dtypes)
period object
respondent object
type object
value int64
value-units object
y float64
dtype: object
As you can notice from the above output, the series value is set as a float64
(e.g., numeric), however, the series timestamp - period
is a string. Therefore, we will have to reformat it into a datetime
object using the to_datetime
function:
"period"] = pd.to_datetime(data["period"]) data[
We can validate the object format after the conversion of the period
column:
print(data.dtypes)
period datetime64[ns]
respondent object
type object
value int64
value-units object
y float64
dtype: object
Reformat the DataFrame to Nixtla Input format
Nixtla libraries inputs use the following there columns data structure:
unique_id
- the series ID, enabling the use of a time series object with multiple time seriesds
- the series timestampy
- the series values
Let’s reformat the DataFrame object by reducing unused dimensions and keeping the series timestamp and values:
= data[["period", "value"]]
ts = ts.sort_values("period")
ts = ts.rename(columns = {"period": "ds", "value": "y"})
ts "unique_id"] = "1"
ts[
ts.head() ts.tail()
ds | y | unique_id | |
---|---|---|---|
52604 | 2024-12-31 20:00:00 | 456107 | 1 |
52605 | 2024-12-31 21:00:00 | 457670 | 1 |
52606 | 2024-12-31 22:00:00 | 464025 | 1 |
52607 | 2024-12-31 23:00:00 | 473192 | 1 |
52608 | 2025-01-01 00:00:00 | 477136 | 1 |
Let’s now plot the series using the plot function from the StatsForecast
module:
from statsforecast import StatsForecast
StatsForecast.plot(ts)
By default, the plot series uses the matplotlib
library as the visualization engine. You can modify it to plotly
using the engine
argument:
= "plotly").update_layout(height=300) StatsForecast.plot(ts, engine
Subsetting the series
As we want to use later two full yearly cycles to train the models, we will subset the series to the last 25 months (leaving at least a month for testing). Let’s start by defining the end time of the series as the last full day (e.g., that has at least 24 hours) by finding the most recent timestamp floor and subtracting 1 hour.
= ts["ds"].max().floor(freq = "d") - datetime.timedelta(hours = 1)
end
= datetime.datetime(2024, 12, 1, 0, 0) - datetime.timedelta(hours = 1)
end print(end)
2024-11-30 23:00:00
Next, let’s subtract from the end
value 25 months:
= end - datetime.timedelta(hours = 24 * 30 * 25)
start print(start)
2022-11-11 23:00:00
Now, we can use the start
and end
values to subset the series:
= ts[(ts["ds"] <= end) & (ts["ds"] >= start)]
df
df.tail()
ds | y | unique_id | |
---|---|---|---|
51859 | 2024-11-30 19:00:00 | 460592 | 1 |
51860 | 2024-11-30 20:00:00 | 456492 | 1 |
51861 | 2024-11-30 21:00:00 | 455024 | 1 |
51862 | 2024-11-30 22:00:00 | 462062 | 1 |
51863 | 2024-11-30 23:00:00 | 476419 | 1 |
Reploting the subset series:
= "plotly").update_layout(height=300) StatsForecast.plot(df, engine
= df.copy()
df2
= df2[["ds", "y"]]
df2 = df2.rename(columns = {"ds": "period"})
df2 df2.head()
period | y | |
---|---|---|
33863 | 2022-11-11 23:00:00 | 456403 |
33864 | 2022-11-12 00:00:00 | 458842 |
33865 | 2022-11-12 01:00:00 | 455111 |
33866 | 2022-11-12 02:00:00 | 448035 |
33867 | 2022-11-12 03:00:00 | 438165 |
Saving the Data Back to CSV
last but not least, let’s save the series:
"data/data.csv", index = False)
df.to_csv("data/data2.csv", index = False)
df2.to_csv("data/ts_data.csv", index = False) ts.to_csv(