EIA API - US48 Demand for Electricity Backfill

The goal of this doc is to execute an initial data pull of the demand for electricity in the US (lower 48) from the EIA API. This includes the following functionality:

Load libraries

We will pull the data from the EIA API using a set of functions on the eia_api.py file. This includes the following functions:

  • eia_get - A function for query data from the API. Can pull up to 5000 rows per call
  • eia_backfile - A wrapper function, using batches requests from the API using the eia_get function to serve large requests (more than 5000 rows)
  • day_offset - A helper function creates a vector of dates equally spaced by days
  • hour_offset - A helper function creates a vector of dates equally spaced by days
import eia_api
import eia_etl as etl

The eia_api.py file imports the following libraries:

  • pandas - for data processing
  • datetime - to work with dates and time objects
  • requests - to send GET requests to the EIA API

In addition, we will use the following libraries:

  • os - load environment variables
  • numpy - to create sequences (vectors)
  • plotly - visualize the data
  • pathlib - set file path
  • ydata_profiling - for data profiling
import requests
import json
import os
import datetime
from zoneinfo import ZoneInfo
import numpy as np
import pandas as pd
import plotly.graph_objects as go

Setting Parameters

Let’s start by loading the pipeline metadata:

raw_json = open("./settings/settings.json")
meta_json = json.load(raw_json)
api_path = meta_json["api"]["api_path"]
facets =  meta_json["api"]["facets"]

s = meta_json["backfill"]["start"]
e = meta_json["backfill"]["end"]
start = datetime.datetime(s["year"], s["month"], s["day"], s["hour"])
end = datetime.datetime(e["year"], e["month"], e["day"], e["hour"])

offset = meta_json["backfill"]["offset"]
data_path = meta_json["data"]["data_path"]
log_path = meta_json["data"]["log_path"]

We will use the getenv function from the os library to pull the EIA API key, which is stored as an environment variable named EIA_API_KEY:

eia_api_key = os.getenv('EIA_API_KEY')

We will use the eia_metadata function to pull the series metadata. Note that to pull metadata from the API, we will use as the api_path the series routh path and drop the /data extension (which is used to pull data):

metadata = eia_api.eia_metadata(api_key = eia_api_key, api_path = api_path)

Depending on the API path, the API returns some useful information about the series available on the path:

::: {#cell-Metadata keys .cell execution_count=6}

metadata.meta.keys()
dict_keys(['warnings', 'total', 'dateFormat', 'frequency', 'data', 'description'])

:::

Pulling the Data

By default, the API has a 5000-row limitation per pull. The eia_backfill function enables the handling of a larger data request by sending a batch request and appending back the data:

df = eia_api.eia_backfill(api_key = eia_api_key, 
        api_path = api_path, 
        facets = facets, 
        start = start,
        end = end,
        offset = offset) 

The DataFrame head:

::: {#cell-Review the data 1 .cell execution_count=8}

df.data.head(10)
period respondent respondent-name type type-name value value-units
2255 2019-01-01 01:00:00 US48 United States Lower 48 D Demand 459577 megawatthours
2254 2019-01-01 02:00:00 US48 United States Lower 48 D Demand 451601 megawatthours
2253 2019-01-01 03:00:00 US48 United States Lower 48 D Demand 437803 megawatthours
2252 2019-01-01 04:00:00 US48 United States Lower 48 D Demand 422742 megawatthours
2251 2019-01-01 05:00:00 US48 United States Lower 48 D Demand 407689 megawatthours
2250 2019-01-01 06:00:00 US48 United States Lower 48 D Demand 395452 megawatthours
2249 2019-01-01 07:00:00 US48 United States Lower 48 D Demand 384688 megawatthours
2248 2019-01-01 08:00:00 US48 United States Lower 48 D Demand 375154 megawatthours
2247 2019-01-01 09:00:00 US48 United States Lower 48 D Demand 368486 megawatthours
2246 2019-01-01 10:00:00 US48 United States Lower 48 D Demand 365478 megawatthours

:::

And, the DataFrame tail:

::: {#cell-Review the data 2 .cell execution_count=9}

df.data.tail(10)
period respondent respondent-name type type-name value value-units
9 2024-11-30 15:00:00 US48 United States Lower 48 D Demand 478477 megawatthours
8 2024-11-30 16:00:00 US48 United States Lower 48 D Demand 475688 megawatthours
7 2024-11-30 17:00:00 US48 United States Lower 48 D Demand 470691 megawatthours
6 2024-11-30 18:00:00 US48 United States Lower 48 D Demand 463602 megawatthours
5 2024-11-30 19:00:00 US48 United States Lower 48 D Demand 457759 megawatthours
4 2024-11-30 20:00:00 US48 United States Lower 48 D Demand 453658 megawatthours
3 2024-11-30 21:00:00 US48 United States Lower 48 D Demand 452453 megawatthours
2 2024-11-30 22:00:00 US48 United States Lower 48 D Demand 460692 megawatthours
1 2024-11-30 23:00:00 US48 United States Lower 48 D Demand 475827 megawatthours
0 2024-12-01 00:00:00 US48 United States Lower 48 D Demand 487827 megawatthours

:::

Data Quality checks

We will runn the following data quality checks:

  • Check that the series timestamp is regular (e.g., equaliy spaced)
  • Check for missing values
  • Check for match between the start and end of the series and the request settings
  • Create a log file

Check the Series Timestamp

ts_obj = pd.DataFrame(np.arange(start = start, stop = end + datetime.timedelta(hours = 1), step = datetime.timedelta(hours = 1)).astype(datetime.datetime), columns=["index"])
print(len(ts_obj))
ts_obj  = ts_obj.merge(df.data, left_on = "index", right_on = "period", how="left")
print(len(ts_obj))
51865
51865

Data validation

print(ts_obj["type"].unique())
print(ts_obj["value"].isna().sum())
[nan 'D']
1

Impute Missing Values

ts_obj["impute"] = np.NaN

ts_obj = ts_obj.sort_values(by = ["index"])

missing_index = ts_obj[ts_obj["value"].isnull()].index.tolist()
non_missing_index = ts_obj.index.difference(missing_index).tolist()

for i in missing_index:
    if i > 336:
        ts_obj.loc[i ,"impute"] = (ts_obj.loc[i - 336 ,"value"] + ts_obj.loc[i - 168 ,"value"] + ts_obj.loc[i - 24 ,"value"]) / 3
    elif i > 168:
        ts_obj.loc[i ,"impute"] = (ts_obj.loc[i - 168 ,"value"] + ts_obj.loc[i - 24 ,"value"]) / 2
    elif i > 24:
        ts_obj.loc[i ,"impute"] = ts_obj.loc[i - 24 ,"value"]
    else:
        print("There are not enough observations to impute observation:", i)

ts_obj["y"] = np.NaN

ts_obj.loc[missing_index, "y"] = ts_obj.loc[missing_index, "impute"]
ts_obj.loc[non_missing_index, "y"] = ts_obj.loc[non_missing_index, "value"]
There are not enough observations to impute observation: 0

Plot the Series

We will use Plotly to visualize the series:

::: {#cell-Plot the data .cell execution_count=13}

p = go.Figure()
p.add_trace(go.Scatter(x = ts_obj["index"], y = ts_obj["y"],
                       mode='lines',
                    name='Actual',
                    line=dict(color='royalblue', width=2)))

p.add_trace(go.Scatter(x = ts_obj["index"], y = ts_obj["impute"], name = "Imputed"))
p.update_layout(title = "US Hourly Demand for Electricity")
p.show()

:::

Reformat the Data

df = ts_obj[["index", "respondent", "type", "value", "value-units", "y"]]

Create Log and Check for Missing values

::: {#cell-Set log .cell execution_count=15}

log = {
    "index": 1,
    "respondent": "US48",
    "respondent_type": "Demand",
    "time": datetime.datetime.now(),
    "start": start,
    "end": end,
    "start_act": ts_obj["period"].min(),
    "end_act": ts_obj["period"].max(),
    "start_match": ts_obj["period"].min() == start, 
    "end_match": ts_obj["period"].max() == end, 
    "n_obs": len(ts_obj),
    "na": ts_obj["value"].isna().sum(),
    "imputed": ts_obj["impute"].count(),
    "na_status": "imputed",
    "type": "backfill",
    "update": False,
    "success": False,
    "comments": "Initial data backfill"

}

if ts_obj["value"].isna().sum() == ts_obj["impute"].count():
    log["update"] = True

if log["end"] == log["end_act"] and log["start"] == log["start_act"] and (log["na"] == 0 or  log["update"]):
    log["success"] = True

log_file = pd.DataFrame([log])

log_file
index respondent respondent_type time start end start_act end_act start_match end_match n_obs na imputed na_status type update success comments
0 1 US48 Demand 2024-12-09 17:24:30.844626 2019-01-01 2024-12-01 2019-01-01 01:00:00 2024-12-01 False True 51865 1 0 imputed backfill False False Initial data backfill

:::

Last but not least, we will check if the start and end of the series match the GET request settings and save the data and log:

if log_file.loc[0, "success"]:
    print("Save the data into CSV file")
    df.to_csv(data_path, index = False)
    log_file.loc[0, "update"] = True
    print("Save the metadata into CSV file")
    log_file.to_csv(log_path, index = False)

    
else:
    log_file.iloc[0]["success"] = False
    log_file.iloc[0]["update"] = False
/tmp/ipykernel_20783/329208777.py:10: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/tmp/ipykernel_20783/329208777.py:11: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy