EIA API - US48 Demand for Electricity Backfill

The goal of this doc is to execute an initial data pull of the demand for electricity in the US (lower 48) from the EIA API. This includes the following functionality:

Setting parameters and pulling the data
Data quality checks
Saving the data and creating a log file
Data profiling

Load libraries

We will pull the data from the EIA API using a set of functions on the eia_api.py file. This includes the following functions:

eia_get - A function for query data from the API. Can pull up to 5000 rows per call
eia_backfile - A wrapper function, using batches requests from the API using the eia_get function to serve large requests (more than 5000 rows)
day_offset - A helper function creates a vector of dates equally spaced by days
hour_offset - A helper function creates a vector of dates equally spaced by days

import eia_api
import eia_etl as etl

The eia_api.py file imports the following libraries:

pandas - for data processing
datetime - to work with dates and time objects
requests - to send GET requests to the EIA API

In addition, we will use the following libraries:

os - load environment variables
numpy - to create sequences (vectors)
plotly - visualize the data
pathlib - set file path
ydata_profiling - for data profiling

import requests
import json
import os
import datetime
from zoneinfo import ZoneInfo
import numpy as np
import pandas as pd
import plotly.graph_objects as go

Setting Parameters

Let’s start by loading the pipeline metadata:

raw_json = open("./settings/settings.json")
meta_json = json.load(raw_json)
api_path = meta_json["api"]["api_path"]
facets =  meta_json["api"]["facets"]

s = meta_json["backfill"]["start"]
e = meta_json["backfill"]["end"]
start = datetime.datetime(s["year"], s["month"], s["day"], s["hour"])
end = datetime.datetime(e["year"], e["month"], e["day"], e["hour"])

offset = meta_json["backfill"]["offset"]
data_path = meta_json["data"]["data_path"]
log_path = meta_json["data"]["log_path"]

We will use the getenv function from the os library to pull the EIA API key, which is stored as an environment variable named EIA_API_KEY:

eia_api_key = os.getenv('EIA_API_KEY')

We will use the eia_metadata function to pull the series metadata. Note that to pull metadata from the API, we will use as the api_path the series routh path and drop the /data extension (which is used to pull data):

metadata = eia_api.eia_metadata(api_key = eia_api_key, api_path = api_path)

Depending on the API path, the API returns some useful information about the series available on the path:

::: {#cell-Metadata keys .cell execution_count=6}

metadata.meta.keys()

dict_keys(['warnings', 'total', 'dateFormat', 'frequency', 'data', 'description'])

:::

Pulling the Data

By default, the API has a 5000-row limitation per pull. The eia_backfill function enables the handling of a larger data request by sending a batch request and appending back the data:

df = eia_api.eia_backfill(api_key = eia_api_key, 
        api_path = api_path, 
        facets = facets, 
        start = start - datetime.timedelta(hours = 1),
        end = end,
        offset = offset)

The DataFrame head:

::: {#cell-Review the data 1 .cell execution_count=8}

df.data.head(10)

	period	respondent	respondent-name	type	type-name	value	value-units
2256	2019-01-01 00:00:00	US48	United States Lower 48	D	Demand	461392	megawatthours
2255	2019-01-01 01:00:00	US48	United States Lower 48	D	Demand	459577	megawatthours
2254	2019-01-01 02:00:00	US48	United States Lower 48	D	Demand	451601	megawatthours
2253	2019-01-01 03:00:00	US48	United States Lower 48	D	Demand	437803	megawatthours
2252	2019-01-01 04:00:00	US48	United States Lower 48	D	Demand	422742	megawatthours
2251	2019-01-01 05:00:00	US48	United States Lower 48	D	Demand	407689	megawatthours
2250	2019-01-01 06:00:00	US48	United States Lower 48	D	Demand	395452	megawatthours
2249	2019-01-01 07:00:00	US48	United States Lower 48	D	Demand	384688	megawatthours
2248	2019-01-01 08:00:00	US48	United States Lower 48	D	Demand	375154	megawatthours
2247	2019-01-01 09:00:00	US48	United States Lower 48	D	Demand	368486	megawatthours

:::

And, the DataFrame tail:

::: {#cell-Review the data 2 .cell execution_count=9}

df.data.tail(10)

	period	respondent	respondent-name	type	type-name	value	value-units
9	2024-12-31 15:00:00	US48	United States Lower 48	D	Demand	452809	megawatthours
8	2024-12-31 16:00:00	US48	United States Lower 48	D	Demand	455632	megawatthours
7	2024-12-31 17:00:00	US48	United States Lower 48	D	Demand	456684	megawatthours
6	2024-12-31 18:00:00	US48	United States Lower 48	D	Demand	458200	megawatthours
5	2024-12-31 19:00:00	US48	United States Lower 48	D	Demand	456795	megawatthours
4	2024-12-31 20:00:00	US48	United States Lower 48	D	Demand	456107	megawatthours
3	2024-12-31 21:00:00	US48	United States Lower 48	D	Demand	457670	megawatthours
2	2024-12-31 22:00:00	US48	United States Lower 48	D	Demand	464025	megawatthours
1	2024-12-31 23:00:00	US48	United States Lower 48	D	Demand	473192	megawatthours
0	2025-01-01 00:00:00	US48	United States Lower 48	D	Demand	477136	megawatthours

:::

Data Quality checks

We will run the following data quality checks:

Check that the series timestamp is regular (e.g., equaliy spaced)
Check for missing values
Check for match between the start and end of the series and the request settings
Create a log file

Check the Series Timestamp

ts_obj = pd.DataFrame(np.arange(start = start , stop = end + datetime.timedelta(hours = 1), step = datetime.timedelta(hours = 1)).astype(datetime.datetime), columns=["index"])
print(len(ts_obj))
ts_obj  = ts_obj.merge(df.data, left_on = "index", right_on = "period", how="left")
print(len(ts_obj))

52609
52609

Data validation

print(ts_obj["type"].unique())
print(ts_obj["value"].isna().sum())

['D']
0

Impute Missing Values

ts_obj["impute"] = np.NaN

ts_obj = ts_obj.sort_values(by = ["index"])

missing_index = ts_obj[ts_obj["value"].isnull()].index.tolist()
non_missing_index = ts_obj.index.difference(missing_index).tolist()

for i in missing_index:
    if i > 336:
        ts_obj.loc[i ,"impute"] = (ts_obj.loc[i - 336 ,"value"] + ts_obj.loc[i - 168 ,"value"] + ts_obj.loc[i - 24 ,"value"]) / 3
    elif i > 168:
        ts_obj.loc[i ,"impute"] = (ts_obj.loc[i - 168 ,"value"] + ts_obj.loc[i - 24 ,"value"]) / 2
    elif i > 24:
        ts_obj.loc[i ,"impute"] = ts_obj.loc[i - 24 ,"value"]
    else:
        print("There are not enough observations to impute observation:", i)

ts_obj["y"] = np.NaN

ts_obj.loc[missing_index, "y"] = ts_obj.loc[missing_index, "impute"]
ts_obj.loc[non_missing_index, "y"] = ts_obj.loc[non_missing_index, "value"]

Plot the Series

We will use Plotly to visualize the series:

::: {#cell-Plot the data .cell execution_count=13}

p = go.Figure()
p.add_trace(go.Scatter(x = ts_obj["index"], y = ts_obj["y"],
                       mode='lines',
                    name='Actual',
                    line=dict(color='royalblue', width=2)))

p.add_trace(go.Scatter(x = ts_obj["index"], y = ts_obj["impute"], name = "Imputed"))
p.update_layout(title = "US Hourly Demand for Electricity")
p.show()

:::

Reformat the Data

df = ts_obj[["index", "respondent", "type", "value", "value-units", "y"]]

Create Log and Check for Missing values

::: {#cell-Set log .cell execution_count=15}

log = {
    "index": 1,
    "respondent": "US48",
    "respondent_type": "Demand",
    "time": datetime.datetime.now(),
    "start": start,
    "end": end,
    "start_act": ts_obj["period"].min(),
    "end_act": ts_obj["period"].max(),
    "start_match": ts_obj["period"].min() == start, 
    "end_match": ts_obj["period"].max() == end, 
    "n_obs": len(ts_obj),
    "na": ts_obj["value"].isna().sum(),
    "imputed": ts_obj["impute"].count(),
    "na_status": "imputed",
    "type": "backfill",
    "update": False,
    "success": False,
    "comments": "Initial data backfill"

}

if ts_obj["value"].isna().sum() == ts_obj["impute"].count():
    log["update"] = True

if log["end"] == log["end_act"] and log["start"] == log["start_act"] and (log["na"] == 0 or  log["update"]):
    log["success"] = True

log_file = pd.DataFrame([log])

log_file

	index	respondent	respondent_type	time	start	end	start_act	end_act	start_match	end_match	n_obs	na	imputed	na_status	type	update	success	comments
0	1	US48	Demand	2025-01-08 12:42:35.567398	2019-01-01	2025-01-01	2019-01-01	2025-01-01	True	True	52609	0	0	imputed	backfill	True	True	Initial data backfill

:::

Last but not least, we will check if the start and end of the series match the GET request settings and save the data and log:

if log_file.loc[0, "success"]:
    print("Save the data into CSV file")
    df.to_csv(data_path, index = False)
    log_file.loc[0, "update"] = True
    print("Save the metadata into CSV file")
    log_file.to_csv(log_path, index = False)

    
else:
    log_file.iloc[0]["success"] = False
    log_file.iloc[0]["update"] = False

Save the data into CSV file
Save the metadata into CSV file