EPF - Electricity Spot Price Forecasting

February 10, 2023
@mzaffran

The aim of this notebook is to introduce some basic notions and tools from machine learning in order to forecast electricity spot prices (a time series) in a supervised approach.
To do so, one has to deal with some pre-processing, data analysis, choice of a (or multiple) model(s), and the optimization of hyper-parameters.
We look for accurate forecasting models but also for models able to generalize on unobserved data.

Let's denote the price at time $t$ $Y_t$.
Assume we have some explanatory variables, $d$, in some $X_t$. $X_t$ contains $d$ columns, one for each variable. We assume the following model: $$ Y_t = f(X_t) + \varepsilon_t$$ where $\varepsilon_t$ is stationary.
We aim at finding the best $\hat{f}$ such that: $$ Y_t = \hat{f}(X_t) + \hat{\varepsilon}_t$$ where $\hat{\varepsilon}_t$ is stationary.

Table of Contents

  1. Data preparation
    1. Data investigation
    2. Split the data before learning anything
  2. Exploratory Data Analysis
  3. Forecasting last semester of 2019
    1. Tools
    2. Optimize hyper-parameters
    3. Evaluation
    4. Forecast last semester
  4. Adding useful features!
    1. Lagged price
    2. Lagged price w.r.t. weekdays/weekends
  5. Can we forecast 2020? 2021?
    1. 2020
    2. 2021
      1. Growing window

References

We start by some basics imports.

Data preparation

Back to table of contents.

Data investigation

**Q1** Print the dimension of the dataframe data.

**Q2** Show the first 10 and last 10 row of data.

**Q3** Show the list of the columns names.

Tips: use Pandas commands, such as shape, head, and tail

**Q4** Analyze the date column. What do you think of its format? Is it adequate?

NaN values have to be taken huge care of. In this practical session, we will not focus on an adequate treatment since it is not the aim. Nevertheless, we will have to impute these values! This will be done after splitting the data set: we can not use the target / future values to impute to the old ones!

**Q5** Report the number of non-a-number (NaN) values for each columns.

Every data analysis starts by having a look at your data. So let's go!

**Q6** Plot the french spot price on all the period.

Here we are cheating a bit, in a pedagogical objective. We should not have a look at what we will try to predict!

**Q7** What do you think of our explanatory variables? What is their temporality with respect to the the temporality of the target (i.e. of what we aim at predicting, the spot prices)? What do you propose? Do you think about a better approach?

Some variables are not always available at the same time. To avoid the forward-looking bias, we propose to shift some variables of 48 hours.

Shift the explanatory variables of 48h. Tips: use .shift()

**Q8** What do you think will be the important and useful information from the date variable? Extract them!

Tips: use date.dt

Split the data before learning anything

Back to table of contents.

We aim at learning a general relationship between input features and a label/target.
We wish to use this learnt relationship to predict on future unseen values.
We will start by re-creating this setting: we will keeping some final values unused and try to predict them as if we were back on that time! But this time, we will be able to quantify our errors.

In addition to the pedagogical objective, it can also help us regarding overfitting.
To avoid overfitting, or at least quantify it, we need to test our forecasting models on unobserved data during the training phase.
A classical approach is to split a training and a testing set. The training set allows the calibration of the parameters of a model, while the testing set evaluates the generalization ability.

**Q9** Create an extract of data, _data_train_, where the data stops at 2019-06-30 23h. We keep the last semester of 2019 to predict.

_Tips: Define limit_train with datetime.strptime, the date where the training set stops, and use this time stamp to extract data_train._

A commun flaw in data science is the missing values.
Sometimes, using incomplete data could lead to ineffective analysis or erratic training of models.
Data imputation is a research field at its own, with various methods (see [1]).
An intuitive approach but non appropriate is to drop observations containing missing values, in order to keep a complete dataset.
However, by definition every incomplete information is lost. Furthermore, if the missing values depend on your features or on your response, you could loose an entire relationship.
There are many other ways to handle incomplete dataset: imputing by the mean, by a regression on the other variables, a stochastic regression, k-nearest neighbors and so on.
In addition, you can sometimes add the treatment of your missing values in your procedure, without imputing beforehand! As said, it is a very active field of research!

Here, we will impute our data, and regress afterwards (not combining both tasks). This is motivated by the work of Le Morvan et al. (2021) [2].
We propose the fill missing values with mean, as it is simple and the practical session is not dedicated to study the missing values.

[1] https://rmisstastic.netlify.app/
[2] Le Morvan et al. (2021)

We can now impute our data set, without using the unseen values.

**Q10** Replace NaN values by the associated variable's means of _data_train_.

The test set (not created yet, that will contain the last 6 months of 2019) shall not contain missing values either!

**Q11** Replace also NaN values of the overall data with the same means of _data_train_.

Exploratory Data Analysis

Back to table of contents.

Several basic tools are useful to better understand the data.
Here we propose to explore some of them.
We focus of analysing the training set, to avoid adding some prior of events which did not occur yet.
A major task in data analysis is the study of the relationship between a given feature and the target, since it is what we are trying to learn. We will try to get some insights on these relationships.

**Q12** Plot the histogram of the french spot prices. It illustrates the probability distribution of the data.

The previous histogram only displays the overall distribution. We added some calendar information, we can try to assess the relations between the price and these calendar information!

**Q13** Show the boxplot of the french spot prices by Weekday, Month and Hour.

Boxplots illustrate average and spread of a given variable.

We observe patterns, on the mean but also on the variance, depending on these calendar variables:

**Q14** What is the limitation of boxplots?

Violin plots are similar to box plots but shows the estimated probability density as well.
One can see the references to go further on the violin plots:
https://en.wikipedia.org/wiki/Violin_plot
https://seaborn.pydata.org/generated/seaborn.violinplot.html
https://chartio.com/learn/charts/violin-plot-complete-guide/
https://mode.com/blog/violin-plot-examples/

**Q15** Show the violin plot of the french spot prices by Weekday, Month and Hour.

Sometimes, plotting this relationship between two series of data in a graph allows to better understand it, or exhibits linear or polynomial relations.

**Q16** Plot the french spot price according to each variables, with its label in absciss.

**Q17** When/what can you conclude?

Tips: use scatter plot

**Q18** Plot the correlation matrice between all variables. What can you say?

Tips: use corr function and heatmap

Warning: correlation is not causality. This is just a hint, nothing more!

Forecasting last semester of 2019

Back to table of contents.

Having analysed our data, we can proceed to the forecasting task. First, we need to create our testing set.

**Q19** Define _data_test_, stopping at 2019-12-31 23h.

Tools

Back to table of contents.

Split X and Y

Models will need to distinguish between variables/features (the X) and the response (the Y).

**Q20** Define X_train and Y_train.

Metrics

We will evaluate the different forecasting models with three metrics. Many more would be available!

R2 : $$R2 = 1- \frac{\sum^T_{t=1} (y_t - \hat{y_t})^2}{\sum^T_{t=1} (y_t - \bar{y})^2}$$ Mean Absolute Error : $$MAE(y_t,\hat{y_t}) =\frac{1}{T}\sum^T_{t=1} |y_t - \hat{y_t}|$$ Mean Squared Error : $$MSE(y_t,\hat{y_t}) = \frac{1}{T}\sum^T_{t=1} (y_t - \hat{y_t})^2$$ where vectors $y_t$ and $\hat{y_t}$ are respectively the real values to be predicted and the predictions, $\bar{y}$ is the average of $y_t$.

**Q21** Define a function _regression_results_ taking true and predicted values, and printing the R2, MAE and RMSE scores.

Models

Linear regression
Linear regression models the label (or target) $y\in\mathbb{R}$ as a linear combination of the $d$-dimensional input $x \in \mathbb{R}^d$: $$\hat{y}(w,x)=w_0+w_1x_1+\dots+w_dx_d$$

The parameter $w=(w_0,\dots,w_d)\in \mathbb{R}^{d+1}$ is fitted by minimizing the loss function $\ell(w)$ on the data, according to $w$: $$\ell(w) = ||y - X w||^2_2 = \sum_{i=1}^N (y_i - x_i^T w)^2$$

For further details, see https://scikit-learn.org/stable/modules/linear_model.html

Decision tree
Decision trees learn a hierarchy of “if-else” questions, leading to a decision. For instance, is the given value (e.g. the consumption) greater than 1.3? The collection of questions can be expressed as a decision tree, where each node characterizes a question and a leaf (a terminal node) indicates an answer.
Find more details on CART trees.

The more nodes a tree has, the more it is precise, but may lack of generalization! The rules would be too much fitted to represented perfectly the data the tree has already seen.
The depth of a tree is a hyper-parameter to be optimized, as well as the minimum number of examples per leaf.
Overall, this can be done by using the following procedure (detailed here):

Random Forests
Random Forests belong to the ensemble methods, which combine multiple models to improve the prediction.
As mentioned, a main drawback of decision trees is the overfitting tendency on the training data. Random forest considers a collection of (slightly) different decisions trees.
By averaging many trees, random forest is able to preserve prediction efficiency of trees while preventing overfitting.
Each tree is build with only a random set of features and random observations from the training set.
There is two main hyper-parameters, the number of considered trees and their depth.

For further details, see https://en.wikipedia.org/wiki/Random_forest
To see examples, see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#examples-using-sklearn-ensemble-randomforestregressor

**Q22** Import from sklearn the models: LinearRegression, DecisionTreeRegressor, RandomForestRegressor

How to select hyper-parameters, models? Cross-validation!

Cross-validation is a statistical method for evaluating the performance of machine learning models. When a model is trained on labeled data, it is assumed that it should also work on new data.
First, the data set is randomly split into K folds. The procedure has a single parameter called 'K' referring to the number of groups into which the sample will be splitted.
The process is repeated until each K-fold serves within the training set.
The average of the recorded scores is the performance metric of the model.

Drawing

Source images: scikit-learn, link below

When considering time series forecasting, some precautions have to be taken in order to preserve temporality. We do not want to learn to predict past values from futures ones! That is why, we consider a time series split.

Drawing

Source images: scikit-learn, link below

After training a model for several hyper-parameters, one can compare the performances of each associated prediction and identify the best parameter.

See https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-of-time-series-data

**Q23** Import TimeSeriesSplit, GridSearchCV and _cross_val_score_ from sklearn, and create a time series splitter into approximately 2 months folds.

**Q24** Create a scorer based on the MSE.

_Tips: use make_scorer from sklearn.metrics to ease the task, and think about whether you prefer a high or a low score_

TimeSeriesSplit creates k+1 splits!

Find the "best" tree and forest

Back to table of contents.

**Q25** Find the optimal tree: instanciate your tree model, and compare different versions of it using various parameters of complexity (let's say starting from 0 until 0.5 with 50 equally spaced values).

Tips: use linspace from numpy to create your set of control values, and use GridSearchCV to compare the different values

We fix the seed to ensure reproducibility!

**Q26** Plot the MSE score with respect to the control parameter.

**Q27** Optimize Random Forest hyper-parameters, using a similar approach than previously. We will compare forests with 50, 100 and 200 trees, that has maximal depth ranging from 10 to 30

Tips: find which parameters correspond to the number of trees in the RandomForestRegressor function, and which correspond to maximal depth, use arange from numpy to create a sequential list of integers between two bounds

Evaluation of our 3 models

Back to table of contents.

**Q28** Using _cross_val_score_, print the averaged scores of each model. We have access to a variation of this score, since we have 8 evaluations. Displays also their boxplot!

Tips: recall that we took the opposite of the MSE in the scorer

Forecast last semester

Back to table of contents.

We can now try to forecast last semester of 2019!

**Q29** As for the training, define X_test and Y_test.

The values of the RMSE and the MAE does are not enough to quantify the performances of our algorithm. We need to relate to a baseline/naive model.

**Q30** Which naive predictor do you propose? Implement it!

**Q31** Fit our 3 models on the training data, and predict on the testing one.

**Q32** Print and compare the performances of the predictions.

_Tips: use our custom function regression_results_

Our predictions are worse than the naive predictor, we can do better! But first, let's analyze better our models.

**Q33** Plot the realized prices and our predictions for the first week of our test data. Add a legend.

Analyzing our forest. What is feature importance?

Feature importance assigns a score of input features based on their influence to predict the target.
The more the features will be "responsible" to predict the target the greater will be their score. Feature importance allows a better interpretability of the variables and the forest (indeed, how can we interpret an averaged of 200 trees?). Several approach to compute these scores are proposed in the litterature, depending on the chosen models.

Mean Decrease Impurity (MDI) is one of the variable importance measures in random forests. Few theoretical justifications to use MDI are available. In particular, a paper (Scornet (2021)) analyzes MDI and proves that if input variables are independent and in absence of interactions, MDI provides a variance decomposition of the output, where the contribution of each variable is clearly identified.

https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html
Scornet (2021)
Genuer, Poggi and Tuleau (2008)
Genuer, Poggi and Tuleau (2010)

Adding useful features!

Back to table of contents.

**Q34** Given than the naive predictor is better than us, what could you suggest?

Lagged price

**Q35** Create this new feature and add it.

**Q36** Recreate the training and test set with this added feature. What do you have to be careful at?

**Q37** Fit again the models, predict and analyze.

NB: we should re-do the optimization of the hyper-parameters for the tree and the forest. Because of their computational time, we do not do it here, and keep the previous models

Lagged price w.r.t. weekdays/weekends

**Q38** It is better, but do you think our lags make sense given the weekly pattern of our data? Improve the lags! And redo all the analysis.

We have assessed their performance predictions, but not the shape of their errors. This should always be done before any forecast (that is, again, using cross validation): if your errors still contain dependency, or are not zero-centered (among other things) this is highly problematic!

**Q39** Get the residuals ($y_t - \hat{y}_t$) of the cross-validation.

Tips: we can not use any built-in method from sklearn. Code your loop, using the tscv object previously created

**Q40** Plot these residuals. What do you observe? Can you explain it?

**Q41** Display their histogram.

Autocorrelation and Partial Auto-Correlation functions allow to analyse the correlation of a serie to its past.

**Q42** Using _plot_acf_, _plot_pacf_ from statsmodels.graphics.tsaplots, show their autocorrelations. What do you observe? Do you expect it?

Note here that this should absolutely be done for our 3 models. For time considerations, we only do it for the linear regression here.

Can we forecast 2020? 2021?

Back to table of contents.

Now that we have selected our models and assessed that their errors are approximately acceptable, we can try to forecast more difficult periods: 2020 with the quarantine, and 2021 with th price explosion in Fall.

2020

Let's start with 2020.

**Q43** Redefine the training and testing data sets: training end at 2019-12-31 23 and testing at 2020-12-31 23.

**Q44** Regenerate the dumb predictor.

**Q45** And let's go! Fit, predict and compare. What can you say?

The errors are deteriorated, but not that much, and still better than the naive. In practice, we should dive into what happens in March and April, with the huge negative prices. If you are interested, you can do this job and try to improve your model :) In class, we will jump onto 2021, which is also super interesting!

2021

Back to table of contents.

**Q46** Redefine the training and testing data sets: training end at 2020-12-31 23 and testing at 2021-12-31 23.

**Q47** Regenerate the dumb predictor.

**Q48** Here we go, predict and analyze.

**Q49** Let's cheat a bit: plot the prices in 2021.

We can evaluate the evolution of our errors with time. To do so, implement a moving average average function that should average only around $w$ points symetrically distributed around each $x$.

**Q50** Create the moving average function.

Tips: use convolve from numpy

**Q51** Apply this function on our errors (RMSE) with a window of size approximately one month. Plot these errors.

**Q52** What can you propose as solution to improve the models?

Growing window?

Back to table of contents.

TQDM is a really useful package allowing you to monitor the time consumption of a loop, as well as its expected time before it finishes.

We have managed to improve our performances, and beat our naive predictor.

References

Some references to go further on the topic:

Electricity price forecasting: A review of the state-of-the-art with a look into the future
Weron, 2014, International Journal of Forecasting

Forecasting day-ahead electricity prices: A review of state-of-the-art algorithms, best practices and an open-access benchmark
Lago et al., 2021, Applied Energy

Back to table of contents.