import eia_api
import eia_etl as etlEIA API - US48 Demand for Electricity Backfill
The goal of this doc is to execute an initial data pull of the demand for electricity in the US (lower 48) from the EIA API. This includes the following functionality:
- Setting parameters and pulling the data
- Data quality checks
- Saving the data and creating a log file
- Data profiling
Load libraries
We will pull the data from the EIA API using a set of functions on the eia_api.py file. This includes the following functions:
eia_get- A function for query data from the API. Can pull up to 5000 rows per calleia_backfile- A wrapper function, using batches requests from the API using theeia_getfunction to serve large requests (more than 5000 rows)day_offset- A helper function creates a vector of dates equally spaced by dayshour_offset- A helper function creates a vector of dates equally spaced by days
The eia_api.py file imports the following libraries:
pandas- for data processingdatetime- to work with dates and time objectsrequests- to sendGETrequests to the EIA API
In addition, we will use the following libraries:
os- load environment variablesnumpy- to create sequences (vectors)plotly- visualize the datapathlib- set file pathydata_profiling- for data profiling
import requests
import json
import os
import datetime
from zoneinfo import ZoneInfo
import numpy as np
import pandas as pd
import plotly.graph_objects as goSetting Parameters
Let’s start by loading the pipeline metadata:
raw_json = open("./settings/settings.json")
meta_json = json.load(raw_json)
api_path = meta_json["api"]["api_path"]
facets = meta_json["api"]["facets"]
s = meta_json["backfill"]["start"]
e = meta_json["backfill"]["end"]
start = datetime.datetime(s["year"], s["month"], s["day"], s["hour"])
end = datetime.datetime(e["year"], e["month"], e["day"], e["hour"])
offset = meta_json["backfill"]["offset"]
data_path = meta_json["data"]["data_path"]
log_path = meta_json["data"]["log_path"]We will use the getenv function from the os library to pull the EIA API key, which is stored as an environment variable named EIA_API_KEY:
eia_api_key = os.getenv('EIA_API_KEY')We will use the eia_metadata function to pull the series metadata. Note that to pull metadata from the API, we will use as the api_path the series routh path and drop the /data extension (which is used to pull data):
metadata = eia_api.eia_metadata(api_key = eia_api_key, api_path = api_path)Depending on the API path, the API returns some useful information about the series available on the path:
::: {#cell-Metadata keys .cell execution_count=6}
metadata.meta.keys()dict_keys(['warnings', 'total', 'dateFormat', 'frequency', 'data', 'description'])
:::
Pulling the Data
By default, the API has a 5000-row limitation per pull. The eia_backfill function enables the handling of a larger data request by sending a batch request and appending back the data:
df = eia_api.eia_backfill(api_key = eia_api_key,
api_path = api_path,
facets = facets,
start = start - datetime.timedelta(hours = 1),
end = end,
offset = offset) The DataFrame head:
::: {#cell-Review the data 1 .cell execution_count=8}
df.data.head(10)| period | respondent | respondent-name | type | type-name | value | value-units | |
|---|---|---|---|---|---|---|---|
| 2256 | 2019-01-01 00:00:00 | US48 | United States Lower 48 | D | Demand | 461392 | megawatthours |
| 2255 | 2019-01-01 01:00:00 | US48 | United States Lower 48 | D | Demand | 459577 | megawatthours |
| 2254 | 2019-01-01 02:00:00 | US48 | United States Lower 48 | D | Demand | 451601 | megawatthours |
| 2253 | 2019-01-01 03:00:00 | US48 | United States Lower 48 | D | Demand | 437803 | megawatthours |
| 2252 | 2019-01-01 04:00:00 | US48 | United States Lower 48 | D | Demand | 422742 | megawatthours |
| 2251 | 2019-01-01 05:00:00 | US48 | United States Lower 48 | D | Demand | 407689 | megawatthours |
| 2250 | 2019-01-01 06:00:00 | US48 | United States Lower 48 | D | Demand | 395452 | megawatthours |
| 2249 | 2019-01-01 07:00:00 | US48 | United States Lower 48 | D | Demand | 384688 | megawatthours |
| 2248 | 2019-01-01 08:00:00 | US48 | United States Lower 48 | D | Demand | 375154 | megawatthours |
| 2247 | 2019-01-01 09:00:00 | US48 | United States Lower 48 | D | Demand | 368486 | megawatthours |
:::
And, the DataFrame tail:
::: {#cell-Review the data 2 .cell execution_count=9}
df.data.tail(10)| period | respondent | respondent-name | type | type-name | value | value-units | |
|---|---|---|---|---|---|---|---|
| 9 | 2024-12-31 15:00:00 | US48 | United States Lower 48 | D | Demand | 452809 | megawatthours |
| 8 | 2024-12-31 16:00:00 | US48 | United States Lower 48 | D | Demand | 455632 | megawatthours |
| 7 | 2024-12-31 17:00:00 | US48 | United States Lower 48 | D | Demand | 456684 | megawatthours |
| 6 | 2024-12-31 18:00:00 | US48 | United States Lower 48 | D | Demand | 458200 | megawatthours |
| 5 | 2024-12-31 19:00:00 | US48 | United States Lower 48 | D | Demand | 456795 | megawatthours |
| 4 | 2024-12-31 20:00:00 | US48 | United States Lower 48 | D | Demand | 456107 | megawatthours |
| 3 | 2024-12-31 21:00:00 | US48 | United States Lower 48 | D | Demand | 457670 | megawatthours |
| 2 | 2024-12-31 22:00:00 | US48 | United States Lower 48 | D | Demand | 464025 | megawatthours |
| 1 | 2024-12-31 23:00:00 | US48 | United States Lower 48 | D | Demand | 473192 | megawatthours |
| 0 | 2025-01-01 00:00:00 | US48 | United States Lower 48 | D | Demand | 477136 | megawatthours |
:::
Data Quality checks
We will run the following data quality checks:
- Check that the series timestamp is regular (e.g., equaliy spaced)
- Check for missing values
- Check for match between the start and end of the series and the request settings
- Create a log file
Check the Series Timestamp
ts_obj = pd.DataFrame(np.arange(start = start , stop = end + datetime.timedelta(hours = 1), step = datetime.timedelta(hours = 1)).astype(datetime.datetime), columns=["index"])
print(len(ts_obj))
ts_obj = ts_obj.merge(df.data, left_on = "index", right_on = "period", how="left")
print(len(ts_obj))52609
52609
Data validation
print(ts_obj["type"].unique())
print(ts_obj["value"].isna().sum())['D']
0
Impute Missing Values
ts_obj["impute"] = np.NaN
ts_obj = ts_obj.sort_values(by = ["index"])
missing_index = ts_obj[ts_obj["value"].isnull()].index.tolist()
non_missing_index = ts_obj.index.difference(missing_index).tolist()
for i in missing_index:
if i > 336:
ts_obj.loc[i ,"impute"] = (ts_obj.loc[i - 336 ,"value"] + ts_obj.loc[i - 168 ,"value"] + ts_obj.loc[i - 24 ,"value"]) / 3
elif i > 168:
ts_obj.loc[i ,"impute"] = (ts_obj.loc[i - 168 ,"value"] + ts_obj.loc[i - 24 ,"value"]) / 2
elif i > 24:
ts_obj.loc[i ,"impute"] = ts_obj.loc[i - 24 ,"value"]
else:
print("There are not enough observations to impute observation:", i)
ts_obj["y"] = np.NaN
ts_obj.loc[missing_index, "y"] = ts_obj.loc[missing_index, "impute"]
ts_obj.loc[non_missing_index, "y"] = ts_obj.loc[non_missing_index, "value"]Plot the Series
We will use Plotly to visualize the series:
::: {#cell-Plot the data .cell execution_count=13}
p = go.Figure()
p.add_trace(go.Scatter(x = ts_obj["index"], y = ts_obj["y"],
mode='lines',
name='Actual',
line=dict(color='royalblue', width=2)))
p.add_trace(go.Scatter(x = ts_obj["index"], y = ts_obj["impute"], name = "Imputed"))
p.update_layout(title = "US Hourly Demand for Electricity")
p.show():::
Reformat the Data
df = ts_obj[["index", "respondent", "type", "value", "value-units", "y"]]Create Log and Check for Missing values
::: {#cell-Set log .cell execution_count=15}
log = {
"index": 1,
"respondent": "US48",
"respondent_type": "Demand",
"time": datetime.datetime.now(),
"start": start,
"end": end,
"start_act": ts_obj["period"].min(),
"end_act": ts_obj["period"].max(),
"start_match": ts_obj["period"].min() == start,
"end_match": ts_obj["period"].max() == end,
"n_obs": len(ts_obj),
"na": ts_obj["value"].isna().sum(),
"imputed": ts_obj["impute"].count(),
"na_status": "imputed",
"type": "backfill",
"update": False,
"success": False,
"comments": "Initial data backfill"
}
if ts_obj["value"].isna().sum() == ts_obj["impute"].count():
log["update"] = True
if log["end"] == log["end_act"] and log["start"] == log["start_act"] and (log["na"] == 0 or log["update"]):
log["success"] = True
log_file = pd.DataFrame([log])
log_file| index | respondent | respondent_type | time | start | end | start_act | end_act | start_match | end_match | n_obs | na | imputed | na_status | type | update | success | comments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | US48 | Demand | 2025-01-08 12:42:35.567398 | 2019-01-01 | 2025-01-01 | 2019-01-01 | 2025-01-01 | True | True | 52609 | 0 | 0 | imputed | backfill | True | True | Initial data backfill |
:::
Last but not least, we will check if the start and end of the series match the GET request settings and save the data and log:
if log_file.loc[0, "success"]:
print("Save the data into CSV file")
df.to_csv(data_path, index = False)
log_file.loc[0, "update"] = True
print("Save the metadata into CSV file")
log_file.to_csv(log_path, index = False)
else:
log_file.iloc[0]["success"] = False
log_file.iloc[0]["update"] = FalseSave the data into CSV file
Save the metadata into CSV file