import eia_api
import eia_etl as etl
EIA API - US48 Demand for Electricity Backfill
The goal of this doc is to execute an initial data pull of the demand for electricity in the US (lower 48) from the EIA API. This includes the following functionality:
- Setting parameters and pulling the data
- Data quality checks
- Saving the data and creating a log file
- Data profiling
Load libraries
We will pull the data from the EIA API using a set of functions on the eia_api.py
file. This includes the following functions:
eia_get
- A function for query data from the API. Can pull up to 5000 rows per calleia_backfile
- A wrapper function, using batches requests from the API using theeia_get
function to serve large requests (more than 5000 rows)day_offset
- A helper function creates a vector of dates equally spaced by dayshour_offset
- A helper function creates a vector of dates equally spaced by days
The eia_api.py
file imports the following libraries:
pandas
- for data processingdatetime
- to work with dates and time objectsrequests
- to sendGET
requests to the EIA API
In addition, we will use the following libraries:
os
- load environment variablesnumpy
- to create sequences (vectors)plotly
- visualize the datapathlib
- set file pathydata_profiling
- for data profiling
import requests
import json
import os
import datetime
from zoneinfo import ZoneInfo
import numpy as np
import pandas as pd
import plotly.graph_objects as go
Setting Parameters
Let’s start by loading the pipeline metadata:
= open("./settings/settings.json")
raw_json = json.load(raw_json)
meta_json = meta_json["api"]["api_path"]
api_path = meta_json["api"]["facets"]
facets
= meta_json["backfill"]["start"]
s = meta_json["backfill"]["end"]
e = datetime.datetime(s["year"], s["month"], s["day"], s["hour"])
start = datetime.datetime(e["year"], e["month"], e["day"], e["hour"])
end
= meta_json["backfill"]["offset"]
offset = meta_json["data"]["data_path"]
data_path = meta_json["data"]["log_path"] log_path
We will use the getenv
function from the os
library to pull the EIA API key, which is stored as an environment variable named EIA_API_KEY
:
= os.getenv('EIA_API_KEY') eia_api_key
We will use the eia_metadata
function to pull the series metadata. Note that to pull metadata from the API, we will use as the api_path
the series routh path and drop the /data
extension (which is used to pull data):
= eia_api.eia_metadata(api_key = eia_api_key, api_path = api_path) metadata
Depending on the API path, the API returns some useful information about the series available on the path:
::: {#cell-Metadata keys .cell execution_count=6}
metadata.meta.keys()
dict_keys(['warnings', 'total', 'dateFormat', 'frequency', 'data', 'description'])
:::
Pulling the Data
By default, the API has a 5000-row limitation per pull. The eia_backfill function enables the handling of a larger data request by sending a batch request and appending back the data:
= eia_api.eia_backfill(api_key = eia_api_key,
df = api_path,
api_path = facets,
facets = start,
start = end,
end = offset) offset
The DataFrame head:
::: {#cell-Review the data 1 .cell execution_count=8}
10) df.data.head(
period | respondent | respondent-name | type | type-name | value | value-units | |
---|---|---|---|---|---|---|---|
2255 | 2019-01-01 01:00:00 | US48 | United States Lower 48 | D | Demand | 459577 | megawatthours |
2254 | 2019-01-01 02:00:00 | US48 | United States Lower 48 | D | Demand | 451601 | megawatthours |
2253 | 2019-01-01 03:00:00 | US48 | United States Lower 48 | D | Demand | 437803 | megawatthours |
2252 | 2019-01-01 04:00:00 | US48 | United States Lower 48 | D | Demand | 422742 | megawatthours |
2251 | 2019-01-01 05:00:00 | US48 | United States Lower 48 | D | Demand | 407689 | megawatthours |
2250 | 2019-01-01 06:00:00 | US48 | United States Lower 48 | D | Demand | 395452 | megawatthours |
2249 | 2019-01-01 07:00:00 | US48 | United States Lower 48 | D | Demand | 384688 | megawatthours |
2248 | 2019-01-01 08:00:00 | US48 | United States Lower 48 | D | Demand | 375154 | megawatthours |
2247 | 2019-01-01 09:00:00 | US48 | United States Lower 48 | D | Demand | 368486 | megawatthours |
2246 | 2019-01-01 10:00:00 | US48 | United States Lower 48 | D | Demand | 365478 | megawatthours |
:::
And, the DataFrame tail:
::: {#cell-Review the data 2 .cell execution_count=9}
10) df.data.tail(
period | respondent | respondent-name | type | type-name | value | value-units | |
---|---|---|---|---|---|---|---|
9 | 2024-11-30 15:00:00 | US48 | United States Lower 48 | D | Demand | 478477 | megawatthours |
8 | 2024-11-30 16:00:00 | US48 | United States Lower 48 | D | Demand | 475688 | megawatthours |
7 | 2024-11-30 17:00:00 | US48 | United States Lower 48 | D | Demand | 470691 | megawatthours |
6 | 2024-11-30 18:00:00 | US48 | United States Lower 48 | D | Demand | 463602 | megawatthours |
5 | 2024-11-30 19:00:00 | US48 | United States Lower 48 | D | Demand | 457759 | megawatthours |
4 | 2024-11-30 20:00:00 | US48 | United States Lower 48 | D | Demand | 453658 | megawatthours |
3 | 2024-11-30 21:00:00 | US48 | United States Lower 48 | D | Demand | 452453 | megawatthours |
2 | 2024-11-30 22:00:00 | US48 | United States Lower 48 | D | Demand | 460692 | megawatthours |
1 | 2024-11-30 23:00:00 | US48 | United States Lower 48 | D | Demand | 475827 | megawatthours |
0 | 2024-12-01 00:00:00 | US48 | United States Lower 48 | D | Demand | 487827 | megawatthours |
:::
Data Quality checks
We will runn the following data quality checks:
- Check that the series timestamp is regular (e.g., equaliy spaced)
- Check for missing values
- Check for match between the start and end of the series and the request settings
- Create a log file
Check the Series Timestamp
= pd.DataFrame(np.arange(start = start, stop = end + datetime.timedelta(hours = 1), step = datetime.timedelta(hours = 1)).astype(datetime.datetime), columns=["index"])
ts_obj print(len(ts_obj))
= ts_obj.merge(df.data, left_on = "index", right_on = "period", how="left")
ts_obj print(len(ts_obj))
51865
51865
Data validation
print(ts_obj["type"].unique())
print(ts_obj["value"].isna().sum())
[nan 'D']
1
Impute Missing Values
"impute"] = np.NaN
ts_obj[
= ts_obj.sort_values(by = ["index"])
ts_obj
= ts_obj[ts_obj["value"].isnull()].index.tolist()
missing_index = ts_obj.index.difference(missing_index).tolist()
non_missing_index
for i in missing_index:
if i > 336:
"impute"] = (ts_obj.loc[i - 336 ,"value"] + ts_obj.loc[i - 168 ,"value"] + ts_obj.loc[i - 24 ,"value"]) / 3
ts_obj.loc[i ,elif i > 168:
"impute"] = (ts_obj.loc[i - 168 ,"value"] + ts_obj.loc[i - 24 ,"value"]) / 2
ts_obj.loc[i ,elif i > 24:
"impute"] = ts_obj.loc[i - 24 ,"value"]
ts_obj.loc[i ,else:
print("There are not enough observations to impute observation:", i)
"y"] = np.NaN
ts_obj[
"y"] = ts_obj.loc[missing_index, "impute"]
ts_obj.loc[missing_index, "y"] = ts_obj.loc[non_missing_index, "value"] ts_obj.loc[non_missing_index,
There are not enough observations to impute observation: 0
Plot the Series
We will use Plotly to visualize the series:
::: {#cell-Plot the data .cell execution_count=13}
= go.Figure()
p = ts_obj["index"], y = ts_obj["y"],
p.add_trace(go.Scatter(x ='lines',
mode='Actual',
name=dict(color='royalblue', width=2)))
line
= ts_obj["index"], y = ts_obj["impute"], name = "Imputed"))
p.add_trace(go.Scatter(x = "US Hourly Demand for Electricity")
p.update_layout(title p.show()
:::
Reformat the Data
= ts_obj[["index", "respondent", "type", "value", "value-units", "y"]] df
Create Log and Check for Missing values
::: {#cell-Set log .cell execution_count=15}
= {
log "index": 1,
"respondent": "US48",
"respondent_type": "Demand",
"time": datetime.datetime.now(),
"start": start,
"end": end,
"start_act": ts_obj["period"].min(),
"end_act": ts_obj["period"].max(),
"start_match": ts_obj["period"].min() == start,
"end_match": ts_obj["period"].max() == end,
"n_obs": len(ts_obj),
"na": ts_obj["value"].isna().sum(),
"imputed": ts_obj["impute"].count(),
"na_status": "imputed",
"type": "backfill",
"update": False,
"success": False,
"comments": "Initial data backfill"
}
if ts_obj["value"].isna().sum() == ts_obj["impute"].count():
"update"] = True
log[
if log["end"] == log["end_act"] and log["start"] == log["start_act"] and (log["na"] == 0 or log["update"]):
"success"] = True
log[
= pd.DataFrame([log])
log_file
log_file
index | respondent | respondent_type | time | start | end | start_act | end_act | start_match | end_match | n_obs | na | imputed | na_status | type | update | success | comments | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | US48 | Demand | 2024-12-09 17:24:30.844626 | 2019-01-01 | 2024-12-01 | 2019-01-01 01:00:00 | 2024-12-01 | False | True | 51865 | 1 | 0 | imputed | backfill | False | False | Initial data backfill |
:::
Last but not least, we will check if the start and end of the series match the GET request settings and save the data and log:
if log_file.loc[0, "success"]:
print("Save the data into CSV file")
= False)
df.to_csv(data_path, index 0, "update"] = True
log_file.loc[print("Save the metadata into CSV file")
= False)
log_file.to_csv(log_path, index
else:
0]["success"] = False
log_file.iloc[0]["update"] = False log_file.iloc[
/tmp/ipykernel_20783/329208777.py:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/tmp/ipykernel_20783/329208777.py:11: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy