# Normalised Radial Basis Function Neural Networks for Predicting total UK energy

This post details Normalised Radial Basis Function Neural Networks (RBFNN) as a function appropriator for predicting national grid power usage.

RBFNN is an older technique and are not commonly used anymore, however it still a useful exercise to create one from scratch (also fun!).

I don't expect the prediction to be remotely accurate.

Radial Basis Function Networks (RBFN) is a FFNN which uses a radial basis function as a activation function. Instead of a sigmoid function those usually found in Multilayer Perceptrons (MLPs).

## Network Architecture

A RBF network consists of a three layers, input, hidden and output. Where the input weights are the RBF node centres $\mu_{n,k}$.

There are number of functions proposed however in this a case a Gaussian function $(\phi)$ will be used.
$\phi_{k} \left (\left | x_{n} - \mu_{k}\right | \right ) = \exp\left ( -\frac{1}{2} \left ( \frac{\left | x_{n} -\mu_{k}\right |}{\sigma_k} \right )^2 \right )$

The output of the network is the linear sum of activation and output weight:
$y_j = \exp(-\frac{1}{2 \sigma^2} \sum_{k=1}^{K}{(x_k - w_{jk})}^2)$
$y = \phi * W$

## Training

There are many training algorithms that could be employed to train a RBFNN. This section will briefly describe some methods.

### Fixed Centers

In this methodology are assumed fixed. The centres of RBFs are selected randomly from the training dataset. Starting with a large number of centres and then after training pruning nodes which are the least significant. This was proposed by.

The use gradient descent is also proposed in. In a MLP this would be used as a part of backpropagation. Though for RBFs this has a slow convergence time see .

### Two Phase

This approach breaks learning into two phases.

1. Unsupervised learning of centers, $\sigma$. Where the receptive field of hidden nodes are placed over regions of the input space where data are found.
2. Training of output weights $w_{k,i}$.

For the first phase K-means clustering can be used to find optimal node placement. K-means is used because it can place any lesser number of node’s receptive fields optimally over the input space. This is the algorithm that will be used to train the network. For the adjusting of output weights $w$. The following weight update equation can be used:

$W = W+ \alpha * (y - t) * \phi$

Where:

$\alpha$ is the learning rate.
$W$  is the output weight matrix
$y$  is the output of the network
$t$ is the target output

## Testing

The root mean squared error (RMSE) is one of many metrics that can be used to test the error in a network. Calculating the RMS error:
$RMS = \sqrt{\frac{1}{M}\sum_{i=1}^{M}{(y_{i}^{p} - y_{id}^{p})}^2}$
$RMS_{avg} = \frac{1}{P}\sum_{p=1}^{P}RMS$

## Hyperparameter Optimization

There are two main parameters that can be changed sigma $\sigma$, the width of the radial basis function. The number of nodes in the hidden layer. The learning rate $\alpha$ can also be changed but this doesn’t have much effect a value of 0.2 sigma is sufficient.

### Sigma ($\sigma$)

Changing the width of the RBF allows each hidden node’s receptive field to cover more of the input space.

### Hidden Nodes ($\phi$)

Having the right number of hidden nodes is important. Having too many nodes can effectively enable the network to learn the training set (Overfitting). Having too little nodes can lead to underfitting, though less nodes means faster network training.

# Normalised Radial Basis Function Networks

Normalising the RBF was proposed by. The network output becomes an activity weighted average of the input weights, in which the weights from the most active inputs contribute most to the value of the output. However in that paper  preform normalisation at each hidden node, instead here normalisation is preformed at the output layer which preserves locality see . See equation  [eq:normOut ](#eq:normOut) below.

$y(x) = \frac{\sum_{N = 1}^{N}w_n \phi(\left | x - x_n \right |)} {\sum_{N = 1}^{N}\phi(\left | x - x_n \right |)}$<span id="eq:normOut" label="eq:normOut">$eq:normOut$</span>

This decreases error in the network by increasing generalisation.

# Task 1: Implementing One Dimensional NRBFNN

Using the algorithms described above. we will make a one Dimensional NRBFNN was created which fits a simple 10-point linear function using fixed centres. To test basic functionality. Given a set of $x$ values and target $y$ values it outputs $y$. The architecture is same as show inFigure [RBFArch].

## Sigma Optimisation

A optimal sigma value can be found testing a range of values, in this case the sigma value has been tested between $0.1$ and $1$ with a step of $0.1$. See Table [sigopttwo] below.

Sigma trial values, optimal value highlighted.
Sigma Train RMS Test RMS
0.1 1.37E-10 0.10053
0.2 3.16E-09 0.10158
0.3 0.00043878 0.10941
0.4 0.017692 0.11483
0.5 0.065148 0.096863
0.6 0.06484 0.097723
0.7 0.075326 0.093764
0.8 0.082668 0.093089
0.9 0.081227 0.091509
1 0.080652 0.093284

The optimal sigma for the dataset is $0.9$ if the application of the networks requires a high accuracy another round of trails could be done to gain that e.g $0.91:0.01:0.99$.

# Task 2: Multi Dimensional NRBFNN & Prediction

Use NRBFNN to predict the total UK energy demand.

## Data & Data preprocessing

The training dataset will be comprised of data from between 2012 and 2016. 2017 data will be used as a validation set against the networks prediction. Another possible solution would be to use data between 2012 and 2015 as a training set. 2016 as a testing set and 2017 as a validation set. However this was not chosen due to 2016 being the most valuable set to train on due to its proximity and quality compared to earlier sets.

The dataset comprises of many components of the overall power demand. Which we don’t need to be concerned with for predicting power usage. Such as a breakdown of each source of power fulfilment (e.g coal, nuclear, wind etc…) These where removed. Leaving just the timestamp and the demand in Megawatts (MW).

2012-01-01 01:00:01 31093

There are $524277$ patterns in the training set, each pattern is a reading taken around every five minutes. The readings are not uniformly distributed and can be anywhere up to a whole minute late.[late]

Each dataset is a year long taking and hourly average over the twelve five minute samples. This makes a years dataset $~8760$ patterns depending on if the year was a leap year. The data is averaged over each hour due to a couple of key reasons:

1. Computational Complexity: If all patterns where used from 2012-2016 there would be $~420480$ patterns. This is a very large training dataset, with an hourly average $~35040$ patterns can be achieved which is much easier to compute.
2. Error Reduction: By averaging hourly the ability to accurately predict power demand with a resolution of five minutes is lost. In the application of predicting power demand this would be desirable. However by averaging this data we improve hourly, weekly and yearly predictions due to averaging out the readings which maybe erroneous. More importantly we somewhat counteract the highly variant five minute reading time.

After averaging the dataset, the timestamp is converted to the day of the year, day of week and hour of day. Using the these metrics gives the network a clear idea of where this data is.

1. Day of year: could give insight into seasons, this is useful due to power usage possibility increasing in the latter months due to the need of heating.
2. Day of week: less energy demand at weekends due to large percentage of workforce not at work.
3. Hour of Day: less energy demand when people are asleep.

There is no need to give the network day of month, due to it technically being included in day of year.

This and the demand are then rescaled between $0.1 - 1$ e.g day of year is divided by the number of days in a year $\frac{1}{365} \approx 0.0027$ as can be seen in
Table [tbl:data-rescale].

Example of data scaling.
DoY DoW HoD Demand (MW)
1 1 1 31093
0.0027 0.1429 0.0417 0.31093

## Dataset Anomalies

In many of the yearly datasets there are a number of spikes in usage (See Figure [fig:2013_data]) which are difficult to predict. These usually correspond to events such as football finals and terrorist attacks. However there is no definite way of correlating the spike in demand with the event.

These spikes can not be predicted as they are often totally unexpected. Even if network was aware that there was an event at a given time it would be very hard to predict demand. Like weather prediction there are a too many variables to gain very high accuracy without the possibility of creating simulated realities.

Other interesting features of the dataset:

1. Total UK power demand is decreasing overtime. (Adds to inaccuracy)
2. 2017 Data has two areas of missing or erroneous readings. See Figure [fig:2017_data].
3. Some years have large negative demand spikes which could be related to grid outages, solar flares or erroneous readings of the span of a few hours.
4. Aforementioned inaccuracy in reading timing. See Section [late].

The input to the network is now a four by one vector, the network needs to be altered to reflect this. The input layer becomes 4 nodes.

## Optimization

Optimisation was carried out by creating a wrapper around existing code. See Algorithm [agl:opt-wrapper]. For more detail about optimisation see Section 1.4.

Create array of networks (M).
iterations = (15)

### Sigma ($\sigma$)

Sigma values where tested in the range of ${0.1,! 0.2,! \ldots,! 1}$. From this set of testing $\sigma = 0.1$ was optimal with a Test RMS of $0.03026319$ at $10916$ hidden nodes. After concluding this $\sigma = 0.09$ was tested to see if improvement was made.

### Hidden Nodes ($\phi$)

To optimise the hidden nodes a range of hidden layer sizes where tested in respect to the training set $T = {t_1, t_2, \ldots, t_i}$. The following hidden node values where tested.
$\frac{|T|}{512}, \frac{|T|}{256}, \frac{|T|}{128}, \frac{|T|}{64}, \frac{|T|}{32}, \frac{|T|}{16}, \frac{|T|}{8}, \frac{|T|}{4}, \frac{|T|}{2}, |T|$

The optimal amount of hidden nodes depends on your definition of optimal. If you wish to have the most accurate prediction then a large number of nodes if preferable
in this case this is $\frac{|T|}{4} = 10916$[1] Though if time to produce a prediction is more important then $\frac{|T|}{16} = 2729$ nodes is optimal. RMSE is (0.186%) higher however training and calculation time become much lower. See Table [tbl:node:opt].

Hidden node optimisation at $(\sigma = 0.1)$
Nodes Train RMS Test RMS $\delta$ (%)
10916 0.03026319 0.01755478 0.14
5458 0.03030537 0.01759055 0.05
2729 0.03031952 0.01763330 0.74
1365 0.03054313 0.01769537 0.43
682 0.03067295 0.01794032 5.57
341 0.03238211 0.01894481 16.76
171 0.03780812 0.02392022 19.63
85 0.04522908 0.03144102 -

From the results in Table [tbl:node:opt] it is clear that the amount of affects the error less than optimising $\sigma$. Furthermore that as the number of nodes increases the effectiveness of reducing error decreases.

## Results

1. Only up to $\frac{|T|}{4}$ nodes Where tested due to lack of compute.