EIA API - US48 Demand for Electricity Backfill

The goal of this doc is to execute an initial data pull of the demand for electricity in the US (lower 48) from the EIA API. This includes the following functionality:

Load libraries

We will pull the data from the EIA API using a set of functions on the eia_api.py file. This includes the following functions:

  • eia_get - A function for query data from the API. Can pull up to 5000 rows per call
  • eia_backfile - A wrapper function, using batches requests from the API using the eia_get function to serve large requests (more than 5000 rows)
  • day_offset - A helper function creates a vector of dates equally spaced by days
  • hour_offset - A helper function creates a vector of dates equally spaced by days
import eia_api
import eia_etl as etl

The eia_api.py file imports the following libraries:

  • pandas - for data processing
  • datetime - to work with dates and time objects
  • requests - to send GET requests to the EIA API

In addition, we will use the following libraries:

  • os - load environment variables
  • numpy - to create sequences (vectors)
  • plotly - visualize the data
  • pathlib - set file path
  • ydata_profiling - for data profiling
import requests
import json
import os
import datetime
from zoneinfo import ZoneInfo
import numpy as np
import pandas as pd
import plotly.graph_objects as go

Setting Parameters

Let’s start by loading the pipeline metadata:

raw_json = open("./settings/settings.json")
meta_json = json.load(raw_json)
api_path = meta_json["api"]["api_path"]
facets =  meta_json["api"]["facets"]

s = meta_json["backfill"]["start"]
e = meta_json["backfill"]["end"]
start = datetime.datetime(s["year"], s["month"], s["day"], s["hour"])
end = datetime.datetime(e["year"], e["month"], e["day"], e["hour"])

offset = meta_json["backfill"]["offset"]
data_path = meta_json["data"]["data_path"]
log_path = meta_json["data"]["log_path"]

We will use the getenv function from the os library to pull the EIA API key, which is stored as an environment variable named EIA_API_KEY:

eia_api_key = os.getenv('EIA_API_KEY')

We will use the eia_metadata function to pull the series metadata. Note that to pull metadata from the API, we will use as the api_path the series routh path and drop the /data extension (which is used to pull data):

metadata = eia_api.eia_metadata(api_key = eia_api_key, api_path = api_path)

Depending on the API path, the API returns some useful information about the series available on the path:

::: {#cell-Metadata keys .cell execution_count=6}

metadata.meta.keys()
dict_keys(['warnings', 'total', 'dateFormat', 'frequency', 'data', 'description'])

:::

Pulling the Data

By default, the API has a 5000-row limitation per pull. The eia_backfill function enables the handling of a larger data request by sending a batch request and appending back the data:

df = eia_api.eia_backfill(api_key = eia_api_key, 
        api_path = api_path, 
        facets = facets, 
        start = start - datetime.timedelta(hours = 1),
        end = end,
        offset = offset) 

The DataFrame head:

::: {#cell-Review the data 1 .cell execution_count=8}

df.data.head(10)
period respondent respondent-name type type-name value value-units
2256 2019-01-01 00:00:00 US48 United States Lower 48 D Demand 461392 megawatthours
2255 2019-01-01 01:00:00 US48 United States Lower 48 D Demand 459577 megawatthours
2254 2019-01-01 02:00:00 US48 United States Lower 48 D Demand 451601 megawatthours
2253 2019-01-01 03:00:00 US48 United States Lower 48 D Demand 437803 megawatthours
2252 2019-01-01 04:00:00 US48 United States Lower 48 D Demand 422742 megawatthours
2251 2019-01-01 05:00:00 US48 United States Lower 48 D Demand 407689 megawatthours
2250 2019-01-01 06:00:00 US48 United States Lower 48 D Demand 395452 megawatthours
2249 2019-01-01 07:00:00 US48 United States Lower 48 D Demand 384688 megawatthours
2248 2019-01-01 08:00:00 US48 United States Lower 48 D Demand 375154 megawatthours
2247 2019-01-01 09:00:00 US48 United States Lower 48 D Demand 368486 megawatthours

:::

And, the DataFrame tail:

::: {#cell-Review the data 2 .cell execution_count=9}

df.data.tail(10)
period respondent respondent-name type type-name value value-units
9 2024-12-31 15:00:00 US48 United States Lower 48 D Demand 452809 megawatthours
8 2024-12-31 16:00:00 US48 United States Lower 48 D Demand 455632 megawatthours
7 2024-12-31 17:00:00 US48 United States Lower 48 D Demand 456684 megawatthours
6 2024-12-31 18:00:00 US48 United States Lower 48 D Demand 458200 megawatthours
5 2024-12-31 19:00:00 US48 United States Lower 48 D Demand 456795 megawatthours
4 2024-12-31 20:00:00 US48 United States Lower 48 D Demand 456107 megawatthours
3 2024-12-31 21:00:00 US48 United States Lower 48 D Demand 457670 megawatthours
2 2024-12-31 22:00:00 US48 United States Lower 48 D Demand 464025 megawatthours
1 2024-12-31 23:00:00 US48 United States Lower 48 D Demand 473192 megawatthours
0 2025-01-01 00:00:00 US48 United States Lower 48 D Demand 477136 megawatthours

:::

Data Quality checks

We will run the following data quality checks:

  • Check that the series timestamp is regular (e.g., equaliy spaced)
  • Check for missing values
  • Check for match between the start and end of the series and the request settings
  • Create a log file

Check the Series Timestamp

ts_obj = pd.DataFrame(np.arange(start = start , stop = end + datetime.timedelta(hours = 1), step = datetime.timedelta(hours = 1)).astype(datetime.datetime), columns=["index"])
print(len(ts_obj))
ts_obj  = ts_obj.merge(df.data, left_on = "index", right_on = "period", how="left")
print(len(ts_obj))
52609
52609

Data validation

print(ts_obj["type"].unique())
print(ts_obj["value"].isna().sum())
['D']
0

Impute Missing Values

ts_obj["impute"] = np.NaN

ts_obj = ts_obj.sort_values(by = ["index"])

missing_index = ts_obj[ts_obj["value"].isnull()].index.tolist()
non_missing_index = ts_obj.index.difference(missing_index).tolist()

for i in missing_index:
    if i > 336:
        ts_obj.loc[i ,"impute"] = (ts_obj.loc[i - 336 ,"value"] + ts_obj.loc[i - 168 ,"value"] + ts_obj.loc[i - 24 ,"value"]) / 3
    elif i > 168:
        ts_obj.loc[i ,"impute"] = (ts_obj.loc[i - 168 ,"value"] + ts_obj.loc[i - 24 ,"value"]) / 2
    elif i > 24:
        ts_obj.loc[i ,"impute"] = ts_obj.loc[i - 24 ,"value"]
    else:
        print("There are not enough observations to impute observation:", i)

ts_obj["y"] = np.NaN

ts_obj.loc[missing_index, "y"] = ts_obj.loc[missing_index, "impute"]
ts_obj.loc[non_missing_index, "y"] = ts_obj.loc[non_missing_index, "value"]

Plot the Series

We will use Plotly to visualize the series:

::: {#cell-Plot the data .cell execution_count=13}

p = go.Figure()
p.add_trace(go.Scatter(x = ts_obj["index"], y = ts_obj["y"],
                       mode='lines',
                    name='Actual',
                    line=dict(color='royalblue', width=2)))

p.add_trace(go.Scatter(x = ts_obj["index"], y = ts_obj["impute"], name = "Imputed"))
p.update_layout(title = "US Hourly Demand for Electricity")
p.show()

:::

Reformat the Data

df = ts_obj[["index", "respondent", "type", "value", "value-units", "y"]]

Create Log and Check for Missing values

::: {#cell-Set log .cell execution_count=15}

log = {
    "index": 1,
    "respondent": "US48",
    "respondent_type": "Demand",
    "time": datetime.datetime.now(),
    "start": start,
    "end": end,
    "start_act": ts_obj["period"].min(),
    "end_act": ts_obj["period"].max(),
    "start_match": ts_obj["period"].min() == start, 
    "end_match": ts_obj["period"].max() == end, 
    "n_obs": len(ts_obj),
    "na": ts_obj["value"].isna().sum(),
    "imputed": ts_obj["impute"].count(),
    "na_status": "imputed",
    "type": "backfill",
    "update": False,
    "success": False,
    "comments": "Initial data backfill"

}

if ts_obj["value"].isna().sum() == ts_obj["impute"].count():
    log["update"] = True

if log["end"] == log["end_act"] and log["start"] == log["start_act"] and (log["na"] == 0 or  log["update"]):
    log["success"] = True

log_file = pd.DataFrame([log])

log_file
index respondent respondent_type time start end start_act end_act start_match end_match n_obs na imputed na_status type update success comments
0 1 US48 Demand 2025-01-08 12:42:35.567398 2019-01-01 2025-01-01 2019-01-01 2025-01-01 True True 52609 0 0 imputed backfill True True Initial data backfill

:::

Last but not least, we will check if the start and end of the series match the GET request settings and save the data and log:

if log_file.loc[0, "success"]:
    print("Save the data into CSV file")
    df.to_csv(data_path, index = False)
    log_file.loc[0, "update"] = True
    print("Save the metadata into CSV file")
    log_file.to_csv(log_path, index = False)

    
else:
    log_file.iloc[0]["success"] = False
    log_file.iloc[0]["update"] = False
Save the data into CSV file
Save the metadata into CSV file