Modeling Seasonality

Seasonality represents the periodic variations in a times series. Residual error plots made this repetetive behavior more pronounced by the removal of linear trend from the time series, as was saw in previous section. There are different ways to visualize the season effect on temperature fluctuations. In the following plot, we use Seaborn's boxenplot() function to look at the seasonal mean and standard deviations of the seasonal residual error over the 1895-2019 period.
fig, ax = plt.subplots(2,1,figsize=(12,12),dpi=300, gridspec_kw={'height_ratios': [1,5]})
plt.subplots_adjust(left=0, bottom=0, right=1, top=1, wspace=0, hspace=0.2)
season_groupby = month_to_season_arr[state_df.index.month]
groupby_ = [(state_df.index.year), season_groupby]
linear_model_s = LinearRegression(normalize=True, fit_intercept=True) 
y_data_s = state_df.groupby(groupby_).mean()
x_data_s = np.linspace(1895, 2019, len(y_data_s)).reshape(-1,1)
linear_model_s.fit(x_data_s, y_data_s) 
last_many_data = 30*4 # past 30 years 
a_s, b_s = linear_model_s.coef_[0], linear_model_s.intercept_ # f(x) = a.x + b (season)
y_pred_s = a_s*x_data_s + b_s
# Seasonal residual plot
ax[0].plot(x_data_s[-last_many_data:], (y_data_s-y_pred_s)[-last_many_data:], color='teal', linestyle='-.', label='residual error')
ax[0].set_xlabel('year', fontsize=14, labelpad=10)
ax[0].set_ylabel('residual', fontsize=14, labelpad=10)
ax[0].set_title(f'Residual error of the linear fit of the drift (using seasonal data) in temperature data of {state} from 1989-2019', fontsize=16);
ax[0].grid(False)
# Boxplot
residual_data = (y_data_s-y_pred_s)
residual_data.reset_index(inplace=True)
residual_data.columns = ['year', 'season', 'residual']
g = sns.boxenplot(x="season", y="residual",width=0.8,
                  palette="pastel", data=residual_data, 
                  ax=ax[1], order=['Spring', 'Summer', 'Fall', 'Winter'])
ax[1].set_title(f'Seasonal boxplots of the residual error from a linear fit of the drift in temperature data of {state} from 1895-2019', fontsize=16);
g.set_xlabel(g.get_xlabel(), fontsize=14)
g.set_ylabel(g.get_ylabel(), fontsize=14)
for i in range(2):
    plt.setp(ax[i].get_xticklabels(), fontsize=12)
plt.setp(ax[i].get_yticklabels(), fontsize=12);

Notice that the residual fluctuations in Winter are higher than the other seasons. Temperature fluctuations occur in periods known to us. On a daily basis, around 3pm is usually the hottest time and on a Seasonal basis, Summer is the hottest season of the year. These frequencies are important in modeling the residual fluctuations, however, thanks to the power of math, we can find these frequency even if we don't know them. In the next part, we will talk about Fourier Analysis and more specifically Fast Fourier Transform (FFT), a powerful method to model identify these frequencies and use them to approximate the residuals.

Fourier Series

In short, Fourier analysis is a way of approximating a signal/function by a sum of sine and cosine terms, called the Fourier series. Fourier series is an expansion of a periodic function $T(t)$ of period $P$ using (theoretically) infinite sum of sines and cosines as:

$$ T(t) = \frac{a_0}{2} + \sum_{k=1}^{\infty}{a_k\cos\dfrac{2\pi kt}{P}} + \sum_{k=1}^{\infty}{b_k\sin\dfrac{2\pi kt}{P}} $$ For instance, if you think of a function $f(t)$ being $$ \sin(\frac{3\pi t}{2})+0.5. $$

Reminder:

$\sin(x)$ has a period $2\pi$ because $2\pi$ is the smallest number for which $\sin(t+2\pi) = \sin(t)$, for all $t$. Similarly, the period of $\sin(\dfrac{3\pi t}{2})$ can be found by setting: $$ \frac{3\pi P}{2}=2\pi \ \rightarrow \ P=\dfrac{4}{3}. $$
fig, ax = plt.subplots(1, 1, figsize=(12, 4), dpi=200)
ax.grid(False)
x_range = np.linspace(-4,4,1000)
ax.plot(x_range.reshape(-1,1), np.sin(3*np.pi*x_range/2)+0.5, label = r"$f(t) = \sin(\frac{3\pi t}{2})+0.5$")
ax.legend(bbox_to_anchor=(0.5,1,0,0),loc='lower center');
step=4/3
for i in np.arange(-4,4+0.1,4/3):
    ax.axvline(i, c='r',ls=':');
func = lambda x, pos: f"${x:0.3f}$"
ax.xaxis.set_major_locator(ticker.MultipleLocator(step))
ax.xaxis.set_major_formatter(ticker.FuncFormatter(func))

We can see that the function is being repeated every period $P$ which is shown by dotted, red vertical lines. In this case, we can write $f(t)$ as:

$$ f(t) = \frac{1}{2} + \sum_{k=1}^{1}{1\times\sin\bigg(\dfrac{2\pi \times 1\times t}{4/3}\bigg)} $$

When we compare this form with the definition of Fourier series provided above, we have:

$$ a_0=\dfrac{1}{2}, \ a_k=0 \ (\text{for all }k>0) $$ $$ b_1=1, \ b_k=0 \ (\text{for all }k>1) $$ $$ P=\dfrac{4}{3} $$

In fact, we don't need a whole bunch of $k$s to capture the function precisely; $k=1$ plus an offset $a_0=1$ does the job flawlessly! Now this was pretty obvious. What about a more complex function like:

We can tell by looking at the figure that the period of $f(t)$ is about $2\pi$ but we can't really say more than that by just looking at it! Thankfully, the Fourier Coefficients $a_k$ and $b_k$ for a periodic function $f(t)$ can be calculated as:

$$ a_k = \dfrac{2}{P}\int_{P}f(t)\cdot\cos\big(\dfrac{2\pi}{P}kt\big)\,dt, \ \ \ b_k = \dfrac{2}{P}\int_{P}f(t)\cdot\sin\big(\dfrac{2\pi}{P}kt\big)\,dt $$

Let's calculate the coefficients for $0\leq k \leq 10$ for the function above in one period $P:[0,2\pi]$, assuming that the fucntion is y (we will see shortly what the function actually is!)

import scipy.integrate as integrate
P = 2*np.pi
a_k = []
b_k = []
def integrand_cos(x,k): 
    return y(x)*np.cos(2*np.pi/P*k*x)
def integrand_sin(x,k): 
    return y(x)*np.sin(2*np.pi/P*k*x)
for k in range(11):
    a_k.append(2/P*integrate.quad(lambda x: integrand_cos(x,k), 0, 2*np.pi)[0])
    b_k.append(2/P*integrate.quad(lambda x: integrand_sin(x,k), 0, 2*np.pi)[0])
for i in range(11):
print(r"k={0}: a_{1}={2:.3f}, b_{3}={4:.3f}".format(i,i,a_k[i],i,b_k[i]))
    k=0: a_0=0.000, b_0=0.000
    k=1: a_1=-0.000, b_1=0.000
    k=2: a_2=1.000, b_2=0.000
    k=3: a_3=-0.000, b_3=4.000
    k=4: a_4=0.000, b_4=-0.500
    k=5: a_5=0.000, b_5=-0.000
    k=6: a_6=0.000, b_6=-0.000
    k=7: a_7=-0.000, b_7=-0.000
    k=8: a_8=-0.000, b_8=0.000
    k=9: a_9=-0.000, b_9=0.000
    k=10: a_10=0.000, b_10=0.000

The nonzero coefficients are: $$ a_2=1, \ b_3=4, \ b_4=-0.5 $$ which leads to the Fourier series for ${y}$ as: $$ y(t) = \cos(2t) + \sin(3t) -\dfrac{1}{2}\sin(4t) $$ which is exactly the function plotted in the previous Figure (the constant trend $\dfrac{a_0}{2}=2$ is removed):

import matplotlib.patches as patches
init_notebook_mode()
def col_to_hex(colmap, n, alpha=1):
    """
    A simple function that given the named colormap and the number of colors needed, returns
    a linearly spaced hex color codes belonging to that colormap
    example:
    num_colors = 5
    colors = col_to_hex('coolwarm',num_colors)
    print(colors)|
    
    ['#3a4cc0', '#83a6fbff', '#cdd9ebff', '#f6b79cff', '#d95947ff']
    """
    #     transparency = "%0.2X" % int(alpha*255) # map 0-1 to 0-255 then to hex
    out = []
    for i in range(n):
        r,g,b,_ = plt.cm.get_cmap(colmap,n)(i)
        out.append(f"#{int(r*255):02x}{int(g*255):02x}{int(b*255):02x}{int(alpha*255):02x}")
    return out
def y(x):
    return  4*np.sin(3*x) + np.cos(2*x) - 0.5*np.sin(4*x)
def y_components(x):
    return  4*np.sin(3*x), np.cos(2*x), -0.5*np.sin(4*x)
sns.set(rc={'axes.facecolor':'silver', 'figure.facecolor':'gainsboro', 'legend.facecolor':'white'})
fig, ax = plt.subplots(1, 1, figsize=(12, 6),dpi=80)
ax.grid(False)
amplitudes = [4, 1, 0.5, 0]
names = [r"$y_1=4\sin(3t)$", r"$y_2=\cos(2t)$",r"$y_3=-\frac{1}{2}\sin(4t)$",r"$y=y_1+y_2+y_3$"]
colors = col_to_hex('jet', 5)
num_divs = 500
x_range = np.linspace(-4*np.pi,4*np.pi,num_divs)
y_components = y_components(x_range)
step=np.pi
rectangle_width = 0.1
for j in range(4):
    if j==3:
        y_ = list(y(x_range)) 
    else:
        y_ = list(y_components[j])
        ax.add_patch(patches.Rectangle((-(rectangle_width+0.05)*(j+1)-4*np.pi-0.05, 0),
        rectangle_width, amplitudes[j], edgecolor = colors[j], facecolor = colors[j], fill=True      
     ))
    ax.plot(x_range[:125], y_[:125], color=colors[j], linestyle=':', zorder=-j)
    ax.plot(x_range[125:], y_[125:], color=colors[j], label=names[j])
    
func = lambda x, pos: f"{x/step:.0f}$\pi$"
ax.xaxis.set_major_locator(ticker.MultipleLocator(step))
ax.xaxis.set_major_formatter(ticker.FuncFormatter(func))
ax.legend(bbox_to_anchor=(0.5,1,0,0),loc='lower center');
  • $y=y_1+y_2+y_3$, i.e., Fourier Series expressed the function $y$ as a sum of three periodic functions $y_1$, $y_2$, and $y_3$.
  • The oscillation with $\omega=3$ has the biggest impact on $y$ whereas the green sine wave with $\omega=4$ having a low impact, but it's amplitude is much smaller. This is manifested in the corresponding Fourier coefficients for each component.
  • Note that the Fourier coefficients are the amplitudes of the sinusoidal components (shown with the rectangles)
  • Period of $y$ is the Least Common Multiple (LCM) of the periods of its sinusoidal components, i.e., $2\pi = \text{LCM}(\pi,\dfrac{2\pi}{3},\dfrac{\pi}{2})$

We can think of Fourier Series as transforming a function in time domain, e.g., $y(t)$, to Frequency Domain where it is expressed using Amplitudes and Frequencies/Periods. This is the idea behind Fourier Transform. It can be written as:

$$ F(\omega)=\int_{-\infty}^{\infty}f(t)\,e^{-i\omega t}\,dt $$

We can say $f(t)$ lives in the time domain, and $F(\omega)$ lives in the frequency domain. Similar to Fourier series, the Inverse Fourier Transform can be used to obtain $f(t)$ from $F(\omega)$:

$$ f(t)=\dfrac{1}{2\pi}\int_{-\infty}^{\infty}F(\omega)\,e^{i\omega t}\,d\omega $$

The Fourier transform can be viewed as the limit of the Fourier series of a function with the period approaches to $\infty$, so the limits of integration change from one period $P$ to $(−\infty,\infty)$. While a periodic function can (provably) be expressed as a "discrete superposition" of exponentials, namely, a "Summation", a non-periodic function can not! It only admits a "continuous superposition, namely, an "Integral". Note that instead of using sines and cosines, we can use Euler's formula to serve the same purpose: $$ e^{i\omega t}=\cos(\omega t)+i\sin(\omega t) $$ Exponential functions have the virtue that they are eigenfunctions for differentiation, i.e., differentiation just multiplies them by their exponent multiplier ($i \omega$ in this example): $$ \dfrac{d}{dt}e^{i\omega t}=i\omega\,e^{i\omega t} $$ which makes them very convenient for numerical operations.

Fast Fourier Transform (FFT) and Discrete Fourier Transform (DFT)

So far we only talked about continuous functions, however, temporal data in many practical applicationsis not continuous. For instance in our example, temperature measurement are available in monthly periods. If the original input function/variable contains equally spaced observations (for example, equal time intervals), then the Fourier transform is known as a discrete Fourier transform (DFT).

Discrete Fourier transform (DFT) converts a finite sequence of equally-spaced samples of a function into a same-length sequence of a complex-valued, equally-spaced function of frequency. For a real function/observation $x_n$ where $0\leq n \leq N-1$, the discrete fourier transform (DFT) $X_k$ is defined as: $$ X_k = \sum_{n=0}^{N-1}{x_n e^{-i\,2\pi{kn/N}}} = \sum_{n=0}^{N-1}{x_n[\cos(2\pi{kn/N}) -i \,\sin(2\pi{kn/N})]} $$ where $N$ is total number of samples and $k$ is the frequency in which the response is computed ($0\leq k \leq N-1$). Similar to Inverse Fourier Transform, $x_n$ can be retrieved from $X_k$ using Inverse Discrete Fourier Transform as: $$ x_n = \frac{1}{N} \sum_{k=0}^{N-1} X_k e^{i\,2\pi kn/N} $$

Positive and negative frequencies

I always had trouble making sense of negative frequencies so I hope this helps! From Eauler's Identity we know:

$$ \cos(\theta) = \dfrac{e^{i\,\theta} + e^{-i\,\theta}}{2} \\ \sin(\theta) = \dfrac{e^{i\,\theta} - e^{-i\,\theta}}{2i} $$

This means see that both sine and cosine (and hence all real sinusoids) consist of a sum of equal and opposite circular motion. It might make it easier to think of $\sin(\theta)$ function being made from equal contributions from an anti-clockwise rotating vector $e^{i\theta}$ and a clockwise rotating vector $e^{-i\theta}$ in figure below.

import matplotlib as mpl
import seaborn as sns
from mpl_toolkits.mplot3d import axes3d, Axes3D
import numpy as np
from matplotlib import cm
import matplotlib.pyplot as plt
sns.set(rc={'axes.facecolor':'white', 'figure.facecolor':'white', 'legend.facecolor':'white'})
mpl.rcParams['legend.fontsize'] = 10
fig = plt.figure(figsize=(10,10), dpi=90)
ax = fig.gca(projection='3d')
ax.grid(False)
theta = np.linspace(-2 * np.pi, 2 * np.pi, 100)
# z = np.linspace(-2, 2, 100)
r = 1
x = r * np.sin(theta)
y = r * np.cos(theta)
# ax.plot(x, y, z, label='parametric curve')
ax.plot(x, y, theta, label=r"$e^{i \omega t}$", color='lightgray',zorder=10)
ax.plot(x, 1*np.ones(100), theta, label=r"$\sin(\omega t)$",color="#0C55D5")
ax.plot(-1*np.ones(100), y, theta, label=r"$\cos(\omega t)$",color="#FF0700")
ax.plot(x, y, theta[0]*np.ones(100), alpha=0.2)
ax.set_xlabel('$\sin(\omega t)$', color="#0C55D5")
ax.set_xlim(-1, 1)
ax.set_ylabel('$\cos(\omega t)$', color="#FF0700")
ax.set_ylim(-1, 1)
ax.set_zlabel(r'$\omega t$')
ax.set_zlim(theta[0], theta[-1])
ax.set_xticks([-1,0,1])
ax.set_yticks([-1,0,1])
step=np.pi
func = lambda x, pos: f"{x/step:.0f}$\pi$"
ax.zaxis.set_major_locator(ticker.MultipleLocator(step))
ax.zaxis.set_major_formatter(ticker.FuncFormatter(func))
ax.legend(bbox_to_anchor=(1,1,0,0),loc='upper right');
plt.savefig('fourier_3d.png')
plt.show()

As a result, $X_k$ contain contain both positive and negative frequency terms. We can see that $X_0=\sum \limits_{n=0}^{N-1}x_n$. If we subtract $x_n$ by its mean before the transformation then $X_0=0$. For the case where $x_n$ is real-valued like our case, the DFT output $X_k$ for positive frequencies is the conjugate of the values $X_k$ for negative frequencies. This means that the frequency spectrum will be symmetric. numpy.fft module provides a straightforward way to compute DFT using FFT. Before demonstrating the application of FFT, let's introduce some terminology that we will use.

  • Sampling: When we deal with time series data, the data could've been gathered every day, week, month etc. The process of recording data at constant time units is called sampling.
  • Sampling Rate($\mathbf{f_s}$) denotes the number of data samples recorded per unit of time, usually seconds. If the time unit is in seconds the unit of sampling rate would be Hertz ($Hz$). For instance if we sample a variable twice a minute, the sampling rate is $\dfrac{1}{30}\,Hz$
Ok! Now let's try fft on a sample data that if gathered from $x_n= 5\sin(2\pi t) + \sin(3\pi t) - 4\sin(5 \pi t) + 10 \sin(7 \pi t)$ with a sampling rate 5$Hz$ for a total of 10 seconds.
# sampling rate 
fs = 10
# total_time (sec)
total_time = 2
sampling_times = np.arange(0,total_time,1/fs)
sampling_times_continuous = np.arange(0,total_time,1/fs/100)
N = len(sampling_times)
n = np.arange(N)
T = N/fs
# calculate the frequencies
freq = n/T
def frequency_labels(freq, fs=fs):
    ticks = freq
    labels = [f"{x:.1f}" for x in list(freq[freq>fs/2]) + list(-freq[freq<fs/2][::-1][:-1])]
    if len(labels)==len(ticks)-1:
        labels.insert(len(labels)//2+1, r"$\pm${:.1f}".format(freq[freq>=fs/2][0]))
    return labels
def y(x, cos_amplitude=0, sin_amplitude=0, frequency=1):
    return  cos_amplitude*np.cos(2*np.pi*frequency*x) + sin_amplitude*np.sin(2*np.pi*frequency*x)
# time domain samples
x_n_continuous = y(sampling_times_continuous,sin_amplitude=5,frequency=1) \
               + y(sampling_times_continuous,sin_amplitude=1,frequency=1.5) \
               - y(sampling_times_continuous,sin_amplitude=4,frequency=2.5) \
               + y(sampling_times_continuous,sin_amplitude=10,frequency=3.5)
# time domain samples
x_n = y(sampling_times,sin_amplitude=5,frequency=1) \
    + y(sampling_times,sin_amplitude=1,frequency=1.5) \
    - y(sampling_times,sin_amplitude=4,frequency=2.5) \
    + y(sampling_times,sin_amplitude=10,frequency=3.5)
X_k = np.abs(np.fft.fft(x_n-x_n.mean()))/(N/2)
# sns.set_style("white")
sns.set(rc={'axes.facecolor':'whitesmoke', 'figure.facecolor':'gainsboro', 'legend.facecolor':'white'})
fig, ax = plt.subplots(2, 1, figsize=(12, 12), sharex=False, dpi=80)
plt.subplots_adjust(left=0, bottom=0, right=1, top=1, wspace=0, hspace=0.25)
colors = col_to_hex('tab10', 8)
ax[0].scatter(sampling_times, x_n, marker='o', s=100, label='data samples')
ax[0].plot(sampling_times_continuous, x_n_continuous, 'r:', linewidth=3, label='reconstructed f')
# individual frequencies
ax[0].plot(sampling_times_continuous, y(sampling_times_continuous,sin_amplitude=5,frequency=1), 
           c=colors[0], linestyle='--',label=r'$5\sin(2\pi t)$', alpha=0.5)
ax[0].plot(sampling_times_continuous, y(sampling_times_continuous,sin_amplitude=1,frequency=1.5), 
           c=colors[1], linestyle='--',label=r'$\sin(3 \pi t)$', alpha=0.5)
ax[0].plot(sampling_times_continuous, -y(sampling_times_continuous,sin_amplitude=4,frequency=2.5), 
           c=colors[2], linestyle='--',label=r'$-4\sin(5 \pi t)$', alpha=0.5)
ax[0].plot(sampling_times_continuous, y(sampling_times_continuous,sin_amplitude=10,frequency=3.5), 
           c=colors[3], linestyle='--',label=r'$10\sin(7 \pi t)$', alpha=0.5)
ax[0].set_xlabel(r'$t$', labelpad=5, fontsize=20)
ax[0].set_ylabel(r'$x_n$', labelpad=5, fontsize=20)
ax[0].set_title('Original data(time domain)', fontsize=24, y=1.02)
ax[1].stem(freq, X_k, 'r')
ax[1].set_xticks(freq)
ax[1].set_xticklabels(frequency_labels(freq))
ax[1].scatter(fs/2,0,marker='x', s=100, label=r'$f_{\mathrm{Nyquist}}$')
ax[1].set_xlabel(r'$f\,(Hz)$', labelpad=5, fontsize=20)
ax[1].set_ylabel(r'$\vert X_k \vert$', labelpad=5, fontsize=20)
ax[1].set_title('FFT (frequency domain)', fontsize=24, y=1.02)
for i in range(2):
    plt.setp(ax[i].get_xticklabels(), fontsize=14)
    plt.setp(ax[i].get_yticklabels(), fontsize=14)
    ax[i].legend(bbox_to_anchor=(1,1,0,0), loc='upper left', fontsize=18)
plt.show()

Looks great! There are some important notes worth noting about this figure:

  • We subtracted the mean before computing the FFT. In this example it doesn't make a difference because $\tilde{x_n}=0$. In general though, not subtracting leads to a large zero-frequency component because of what we mentioned previously, $X_0=\sum \limits_{n=0}^{N-1}x_n$.
  • Note that FFT has no knowledge of any of the curves plotted in the top figure. We only feed the data 20 samples shown with blue circles to FFT. The curves are shown to demonstrate that FFT can recunstruct the original data that the samples have been drawn from.
  • Note that the amiplitude corresponding to each frequency is exactly recovered when we divide $X_k$ by the number of samples, however, because each curve is constructed by the sum two equal contributions from a positive and negative frequency, the amplitude is equal to sum of the two symmetric frequencies.
  • Note that the frequency response exhibits symmetry (called "folding") around the frequency that is called the Nyquist frequency. Nyquist frequency is equal to half the sampling rate: $$ f_{\text{Nyquist}}=\dfrac{f_s}{2} $$
  • According to the Nyquist criterion, the sampling frequency, $f_s$, must be at least twice the maximum frequency component in the signal; otherwise, a phenomenon known as aliasing occurs (See here). In other words, the time domain data sampled at a sampling rate $f_s$ can be fully reconstructed if it contains only frequency components below half that sampling rate. Therefore, FFT's highest output frequency is half the sampling rate.
  • We can think about in the context of the bottom figure as this: Reducing the sampling rate causes the highest negative frequency ($-3.5\,Hz$ in this example) getting closer and closer its conjugate ($f=3.5\,Hz$) until they touch, which is when the aliasing happens.
# sampling rates 
fs = 12
# total_time (sec)
total_time = 2
sampling_times = np.arange(0,total_time,1/fs)
sampling_times_continuous = np.arange(0,total_time,1/fs/100)
N = len(sampling_times)
n = np.arange(N)
T = N/fs
# calculate the frequencies
freq = n/T
# time domain samples
x_n_continuous = y(sampling_times_continuous,sin_amplitude=2,frequency=5) 
x_n = y(sampling_times,sin_amplitude=2,frequency=5) 
# Frequency response
X_k = np.abs(np.fft.fft(x_n-x_n.mean()))/(N/2)
# Plot
sns.set(rc={'axes.facecolor':'whitesmoke', 'figure.facecolor':'gainsboro', 'legend.facecolor':'white'})
fig, ax = plt.subplots(2, 1, figsize=(12, 12), sharex=False, dpi=80)
plt.subplots_adjust(left=0, bottom=0, right=1, top=1, wspace=0, hspace=0.25)
colors = col_to_hex('tab10', 8)
ax[0].scatter(sampling_times, x_n, marker='o', s=100, label='data samples')
ax[0].plot(sampling_times_continuous, x_n_continuous, 'r:', linewidth=3, label='continuous f')
ax[0].set_xlabel(r'$t$', labelpad=5, fontsize=20)
ax[0].set_ylabel(r'$x_n$', labelpad=5, fontsize=20)
ax[0].set_title('Original data(time domain)', fontsize=24, y=1.02)
ax[1].stem(freq, X_k, 'r')
ax[1].scatter(fs/2,0,marker='x', s=100, label=r'$f_{\mathrm{Nyquist}}$')
ax[1].set_xlabel(r'$f\,(Hz)$', labelpad=5, fontsize=20)
ax[1].set_ylabel(r'$\vert X_k \vert$', labelpad=5, fontsize=20)
ax[1].set_title('FFT (frequency domain)', fontsize=24, y=1.02)
for i in range(2):
    plt.setp(ax[i].get_xticklabels(), fontsize=14)
    plt.setp(ax[i].get_yticklabels(), fontsize=14)
    ax[i].legend(bbox_to_anchor=(1,1,0,0), loc='upper left', fontsize=18)
plt.show()

Ok! Now that we are done with the math-heavy part, we can examine how we can apply Fourier analysis on the seasonal component of the temperature data. After we subtracted the trend from the original data using linear regression, we were left out with the residual data:

residual_data = (y_data_m-y_pred_m)
residual_data.index.set_names(['year','month'], inplace=True)
residual_data.reset_index(inplace=True)
residual_data.columns = ['year','month','residual']
residual_data.head()
year month residual
0 1895 1 -26.679
1 1895 2 -30.361
2 1895 3 -16.022
3 1895 4 -0.574
4 1895 5 11.378
residual_data.residual.plot();

fft = np.fft.fft((residual_data.residual - residual_data.residual.mean()).values)
plt.plot(np.abs(fft))
plt.title("FFT of temperature data")
plt.xlabel('# Cycles in full window of data (~125 years)');

We can take a look at the dominant frequencies that contribute the most by sorting the resulting Fourier Transform (fft). Note that we only focus on the first half of the response and also normalize that response by the number of data points in the first half of the response. Here, we print the 10 most dominant frequencies:

dominant_freqs_10 = np.argsort((np.abs(fft)[:len(fft)//2]))[-10:][::-1]
dominant_freqs_10
    array([125, 250,   1,   2,  62,  91, 233,  59, 235, 114])

The X_k_10 variable shows the amplitude corresponsing to each of these frequencies:

X_k_10 = np.abs(fft)[dominant_freqs_10]/(len(fft)//2)
X_k_10
    array([22.71 ,  0.694,  0.448,  0.443,  0.441,  0.438,  0.425,  0.405,
           0.4  ,  0.394])

Nice! So what this means is that the most dominant frequency is something that has been repeated 125 times in our time series which is a year! Said differently, what is happening every

$$ \dfrac{\mathrm{len(fft)}}{\mathrm{125}}=\dfrac{\mathrm{1500}}{\mathrm{125}}=\text{12 months} = \text{1 year} $$

Note that our unit of time here is a month. Next, what has been repeated 250 times?

$$ \dfrac{\mathrm{1500}}{\mathrm{250}}=\text{6 months} $$

Every 6 months, i.e., half a year, which denotes the second dominant frequency is due to hot and cold seasons variations ((spring + summer) months and (fall + winter) months). the third and fourth frequencies are affiliated with the entire 125 years and 62.5 years periods, respectively. Next, we see the biannual or biennial effects, corresponding to every two years:

$$ \dfrac{\mathrm{1500}}{\mathrm{62\times 12}}\approx\text{2 years} $$

The phase angle $\phi=\tan^{-1}\bigg(\dfrac{\mathrm{Im}(X_k)}{\mathrm{Re}(X_k)}\bigg)$ can easily be obtained using numpy.angle():

np.angle(fft[dominant_freqs_10], deg=True)
    array([ 174.552, -177.222,  -40.071,   77.718,   18.964,   -2.654,
           -105.005, -163.641, -160.028,   -4.573])
fig, ax = plt.subplots(3, 1, figsize=(12, 12), sharex=False, dpi=80)
freq_ranges = [np.arange(125-5,125+5+1), np.arange(250-5,250+5+1), np.arange(0,10)]
labels = [r"$\dfrac{{{:.0f}}}{{{:.0f}}}={{{:.1f}}}$".format(125,x,y)+' years' \
          for (_,(x,y)) in enumerate([[125, 1], [250,0.5], [1,125]])]
max_indices = [5, 5, 1]
data = np.abs(fft/(len(fft)//2))
plt.subplots_adjust(left=0, bottom=0, right=1, top=1, wspace=0, hspace=0.25)
for i, range_ in enumerate(freq_ranges): 
    ax[i].stem(range_, data[range_])#, label=labels[i]
    markers, stems, bases = ax[i].stem([range_[max_indices[i]]], [data[range_][max_indices[i]]], label=labels[i])
    markers.set_markerfacecolor('red')
    markers.set_markeredgecolor('red')
    stems.set_edgecolor('red')
    ax[i].set_ylabel('Amplitude', fontsize=18, labelpad=10)
    ax[i].legend(loc='upper right')
plt.suptitle('Three dominant frequencies', y=1.05, fontsize=24)
plt.xlabel('Number of occurences in 125 years ($f$)', fontsize=18)
plt.show()

We can define a simple function to approximate the time series residual using the fft data:

temp_data = state_df.groupby([(state_df.index.year),(state_df.index.month)]).mean()
def model_trend(ts_data):
    x_range = np.arange(0, len(ts_data)).reshape(-1,1)
    linReg_model = LinearRegression(normalize=True, fit_intercept=True) 
    linReg_model.fit(x_range, ts_data)  
    a_, b_ = linReg_model.coef_[0][0], linReg_model.intercept_[0]
    ts_model_trend = a_*x_range + b_
    return ts_model_trend
def model_seasonal(ts_data, n_dominant_modes=10):    
    ts_model_trend = model_trend(ts_data)
    residual = np.ravel((ts_data - ts_model_trend).values)
    fft = np.fft.fft(residual - residual.mean())
    dominant_freqs = np.argsort((np.abs(fft)[:len(fft)//2]))[-n_dominant_modes:][::-1]
    phase_angles = np.angle(fft[dominant_freqs])
    amplitudes = np.abs(fft[dominant_freqs]) / (len(fft)/2)
    ts_model_seasonal = np.zeros((len(fft),1))
    t = 2 * np.pi * np.arange(0, len(fft)) / len(fft)
    for i,freq in enumerate(dominant_freqs):
        ts_model_seasonal += amplitudes[i] * np.cos(freq*t + phase_angles[i]).reshape(-1,1)
return ts_model_seasonal + ts_model_trend
fig, ax = plt.subplots(3, 1, figsize=(8,8), sharex=True)
plt.subplots_adjust(left=0, bottom=0, right=1, top=0.92, wspace=0, hspace=0.1)
state = 'AL'
state_df = climate_df[climate_df['state']==state]
state_df.set_index('timestamp', inplace=True)
temp_data = state_df.groupby([(state_df.index.year),(state_df.index.month)]).mean()
num_modes = [1, 10, 50]
first_year = 1895
last_year = 2019
start_year = 2000
end_year = last_year
num_years = end_year-start_year+1
start_index = (start_year-first_year)*12
end_index = start_index + num_years*12 
for i, n_modes in enumerate(num_modes):
    ts_pred = model_seasonal(temp_data, n_dominant_modes=n_modes)
    err = (temp_data-ts_pred).values[start_index:end_index]
    mae = np.abs(err).mean()
    lab = f'prediction using first {n_modes}\n' + f'modes (MAE = {mae:.1f}$^\circ$F)'
    ax[i].plot(temp_data.values[start_index:end_index], color='r', label='real data')
    ax[i].plot(ts_pred[start_index:end_index], color='b', label=lab)
    ax[i].legend(bbox_to_anchor=(1,1.05,0,0), loc='upper left')
func = lambda x, pos: f"{start_year + x/12:.0f}"
ax[-1].xaxis.set_major_locator(ticker.MultipleLocator(24))
ax[-1].xaxis.set_major_formatter(ticker.FuncFormatter(func))
plt.setp(ax[-1].get_xticklabels(), fontsize=14, rotation=45)
plt.xlabel('year', fontsize=18, labelpad=10)
plt.suptitle(f'Seasonal model predictions of {state} temperatures from {start_year} to {end_year}', fontsize=18);

Intuitively, including more frequency modes in our predictor leads to a more accurate prediction.

seasonal_model_residual = (temp_data-ts_pred).iloc[start_index:end_index,:]
seasonal_model_residual.index.set_names(['year','month'], inplace=True)
seasonal_model_residual.reset_index(inplace=True)
seasonal_model_residual.columns = ['year','month','error']
seasonal_model_residual['MAE'] = seasonal_model_residual['error'].apply(lambda x: abs(x))
seasonal_model_residual['month_name'] = seasonal_model_residual['month'].apply(lambda x:months[x-1])
months = [month for month in month_name][1:]
fig, axes = plt.subplots(2, 1, figsize=(12, 8), sharex=True, dpi=100)
plt.subplots_adjust(left=0, bottom=0, right=1, top=0.95, wspace=0, hspace=0.02)
g = sns.boxplot(x="month_name", y="error",
                  palette="tab20", data=seasonal_model_residual, 
                  order=months, showmeans=True, ax=axes[0])
g.set_xlabel('')
g.set_ylabel(r'Seasonal model raw error ($^\circ F$)', fontsize=16, labelpad=10)
plt.setp(g.get_xticklabels(), fontsize=14, rotation=45, ha='center');
g = sns.boxplot(x="month_name", y="MAE",
                palette="tab20", data=seasonal_model_residual, 
                order=months, showmeans=True,ax=axes[1])
g.set_xlabel(r'month', fontsize=16, labelpad=10)
g.set_ylabel(r'Seasonal model MAE ($^\circ F$)', fontsize=16, labelpad=10)
plt.setp(g.get_xticklabels(), fontsize=14, rotation=45, ha='center')
plt.suptitle('Raw and Mean Absolute Error (MAE)', fontsize=22, y=1.0);

Notice that the warmer months are associated with lower errors which makes sense because as we saw previously, temperature fluctuations in cold months are more pronounced compared to the warm months. Hence, the model predictions for the cold seasons are prone to larger errors. Also, comparing the two subplots in Figure BLAH shows that looking at the mean absolute errors might be a better way of interpreting the model performance. For instance, comparing the model performance for different months using the mean raw error (green rectangles in the top subplot) is not a good idea due to cancellation of errors. This is why the mean absolute error is a common measure of forecast error in time series analysis (ref).

So far, we have only tested the model performance on the data that the mdoel is trained on without any kind of validation on the future data that the model has not seen. This is our main concern in the next part of the project. In addition, we will investigate how does

  • the number of data points used in training, i.e. the number of years
and
  • the number of modes,
influence the quality of the predictions.

Modeling noise

We can further improve our seasonal mdoels' temperature predictions by modeling the models' residual error, AKA noise. Noise mainfests the temperature patterns that can not be predicted using the seasonal model. One of the most common way of modeling noise in the context of time series is using autocorrelation functions.

Autocorrelation

Before proceeding further and for the sake of clarification, let's remind everyone of some important statistical definitions: Covariance of two samples $X$ and $Y$ from a population of size $N$ is defined as: $$ \mathrm{Cov}(X,Y) = \frac{1}{N-1} \sum_{i=1}^N (X_i - \bar{X})(Y_i - \bar{Y}) $$ The reason the sample covariance has $N-1$ in the denominator rather than $N$ is basically that the population mean $E(X)$ unknown and is replaced by the sample mean $\bar {X}$. Similarly, Variance of a sample $X$ or Unbiased sample variance is equal to $\mathrm{Cov}(X,X)$ or $$ {\sigma^2}(X) = \frac{1}{N-1} \sum_{i=1}^N (X_i - \bar{X})^2 $$ $$ \rho(X,Y) = \frac{\mathrm{Cov}(X,Y)}{\sigma(X)\sigma(Y)} = \frac {\sum\limits_{i=1}^{n}(X_{i}-{\bar {X}})(Y_{i}-{\bar {Y}})}{{\sqrt {\sum\limits_{i=1}^{n}(X_{i}-{\bar {X}})^{2}}}{\sqrt {\sum\limits_{i=1}^{n}(Y_{i}-{\bar {Y}})^{2}}}} $$ Correlation between two vectors $X$ and $Y$ reveals the similarity/dissimilarity of them:

from scipy.stats import pearsonr
      num_points = 20
      multipliers=  [1,0,-1]
      labels = ['positive', 'no', 'negative']
      cols = col_to_hex('coolwarm',3)
      X = np.arange(num_points)
      amplitude = 2
      noise = amplitude*scipy.stats.uniform(-1, 2).rvs(num_points)
      fig, axes = plt.subplots(1, 3, figsize=(14, 4), dpi=200)
      plt.subplots_adjust(wspace=0.25)
      for i,mp in enumerate(multipliers):
          axes[i].grid(False)
          Y = mp*X + noise
          pearson_corr = pearsonr(X, Y)
          axes[i].scatter(X,Y,s=20,label=f'{labels[i]} correlation ($\\rho=${pearson_corr[0]:.2f})', c=cols[i])
          axes[i].set_xlabel('$X$', fontsize=16, labelpad=5)
          axes[i].set_ylabel('$Y$', fontsize=16)
          axes[i].plot([-5,25], [-5*mp,25*mp], 'r:')
          axes[i].set_xlim([-5,25])
axes[i].legend(bbox_to_anchor=(0.5,1,0,0), loc='lower center')

In the figure above, $\rho$ is the Pearson correlation coefficient, measuring the relationship between two variables $X$ and $Y$. We used scipy.stats's pearsonr method to calculate $\rho$. $\rho$ spans $[-1,1]$ range where $\rho=0$ implies no correlation, and +1 and -1 imply perfect correlation and anti-correlation, respectively.

In time series analysis, autocorrelation is the Pearson correlation between values of the process as a function of the two times or of the time lag. For example, think of a time series data $X_t$ with $n=172$ temperatures hourly recorded as $X_t=[x_1,x_2,\dots,x_n]$. So $X_t$ represents temperatures for 7 days. The data actually shows the hourly temperatures for Manchester, NH from 07/03/2021-19:53:00 to 07/10//2021-18:53:00 (obtained from here)

X_t = [58, 58, 58, 57, 57, 57, 57, 57, 56, 56, 56, 56, 57, 56, 57, 57, 58, 59, 61, 62, 63, 66, 68, 70, 
       71, 69, 70, 68, 66, 63, 60, 58, 58, 58, 58, 58, 58, 58, 59, 59, 60, 62, 64, 68, 70, 72, 74, 76, 
       77, 78, 78, 77, 75, 73, 71, 70, 69, 69, 69, 68, 69, 68, 68, 68, 70, 72, 73, 74, 79, 83, 86, 87, 
       89, 88, 86, 85, 82, 80, 78, 77, 75, 74, 72, 72, 73, 73, 71, 72, 76, 80, 82, 84, 86, 86, 88, 87, 
       87, 87, 84, 82, 82, 78, 75, 72, 71, 70, 69, 70, 69, 69, 64, 63, 63, 64, 65, 67, 67, 70, 70, 68, 
       65, 64, 63, 63, 62, 61, 61, 61, 61, 61, 60, 60, 61, 61, 61, 61, 61, 62, 62, 63, 63, 64, 65, 66, 
       68, 72, 73, 75, 76, 76, 73, 72, 72, 72, 72, 71, 69, 69, 68, 67, 66, 66, 64, 66, 68, 73, 74, 72]
X_t = np.array(X_t)
fig, ax = plt.subplots(1, 1, figsize=(12, 4), dpi=100)
ax.plot(X_t)
func = lambda x, pos: f"{x/24:.0f}"
ax.xaxis.set_major_locator(ticker.MultipleLocator(24))
ax.xaxis.set_major_formatter(ticker.FuncFormatter(func))
ax.set_xlabel('day')
ax.set_ylabel('T($^\circ F$)');

We know that the correlation of $X_t$ wiht itself gives $\rho(X_t,X_t)=1$.

print(f'{pearsonr(X_t, X_t)[0]:.1f}')
    1.0

One would expect to see a strong correlation between $X_t$ and $X_{t-1}=[x_0,x_1,\dots,x_{n-1}]$ since a temperature $x_i$ is not expected to be significantly different from $x_{i-1}$.

print(f'{pearsonr(X_t[1:], X_t[:-1])[0]:.2f}')
    0.98

Similarly, you would expect to see a relatively strong correlation if we compare $X_t$ and $X_{t-2}$:

print(f'{pearsonr(X_t[2:], X_t[:-2])[0]:.2f}')
    0.94

Autocorrelation function generalizes this idea by calculating $\rho(X_t,X_{t-k})$ for $0\leq k \leq N-1$ where $N$ is the length of time series. We can demonstrate this by defining a simple function acf as:

def autocorr(x,lags):
    '''non partial autocorrelation function'''
    mean=np.mean(x)
    var=np.var(x)
    xp=x-mean
    corr=[1. if l==0 else np.sum(xp[l:]*xp[:-l])/len(x)/var for l in lags]
    return np.array(corr)
fig, ax = plt.subplots(1, 1, figsize=(12, 4), dpi=100)
ax.plot(acf(X_t))
ax.plot(autocorr2(X_t, np.arange(0,len(X_t)-2)),'r:')
func = lambda x, pos: f"{x:.0f}"
ax.xaxis.set_major_locator(ticker.MultipleLocator(24))
ax.xaxis.set_major_formatter(ticker.FuncFormatter(func))
ax.set_xlabel('Lag (hours)')
ax.set_ylabel(r'$\rho$');

def corr(x,y):
    '''numpy.corrcoef, partial'''
    corr=[1. if l==0 else np.corrcoef(x[l:],x[:-l])[0][1] for l in lags]
    return np.array(corr)

from pandas.plotting import autocorrelation_plot
f = plt.figure()
# autocorrelation_plot(temp_data['tempf'])
plt.plot(acf((temp_data['tempf']-temp_data['tempf'].mean()).values),'r:')
plt.plot(autocorr1((temp_data['tempf']-temp_data['tempf'].mean()).values, np.arange(0,1498)),'b:')
plt.plot(autocorr2((temp_data['tempf']-temp_data['tempf'].mean()).values, np.arange(0,1498)),'g:')
plt.xlabel('Lag (months)')
# plt.axvline(365*24, color = 'green');.

(temp_data['tempf']-temp_data['tempf'].mean()).values
    array([-20.26 , -25.894,  -8.825, ...,   4.578, -12.609, -11.931])
fig, axes = plt.subplots(2, 1, figsize=(12, 8), sharex=True, dpi=100)
plt.subplots_adjust(left=0, bottom=0, right=1, top=0.95, wspace=0, hspace=0.02)
axes[0].plot(seasonal_model_residual['error'])
axes[1].plot(seasonal_model_residual['MAE'])

seasonal_model_residual['error'].mean()
    0.04747016383969698
fft_noise = np.fft.fft(seasonal_model_residual['error']-seasonal_model_residual['error'].mean())
plt.stem(np.abs(fft_noise))

temp_data['tempf']
    timestamp  timestamp
    1895       1           42.718
               2           37.084
               3           54.152
               4           63.169
               5           69.307
                            ...  
    2019       8           80.390
               9           80.012
               10          67.555
               11          50.369
               12          51.046
Name: tempf, Length: 1500, dtype: float64
from pandas.plotting import autocorrelation_plot
f = plt.figure()
ax = autocorrelation_plot(temp_data['tempf'])
plt.xlabel('Lag (months)')
# plt.axvline(365*24, color = 'green');.

fig, axes = plt.subplots(2, 1, figsize=(12, 8), sharex=True, dpi=100)
plt.subplots_adjust(left=0, bottom=0, right=1, top=0.95, wspace=0, hspace=0.02)
seasonal = model_seasonal(temp_data, n_dominant_modes=5)
autocorrelation_plot((temp_data-seasonal)[-20*12:], ax=axes[0])
autocorrelation_plot(np.abs((temp_data-seasonal))[-20*12:], ax=axes[1])
ax.set_ylim([-0.3,0.3])
plt.xlabel('Lag (months)')

Cross-validation of the full model

As mentioned previously, cross validation for time series data is different from other machine learning algorithms. In this project, I use sliding-window validation technique which is similar to train-test split where the temporal order of observations is preserved. Assuming a time series data of size $N$, sliding-window method uses $N_{tr}$ data points $x_{(i-N_{tr})},x_{(i-N_{tr}+1)},...,x_{(i-1)},x_{(i)}$ to predict $N_{ts}$ data points $x_{(i+1)},...,x_{(i+N_{ts}-1)},x_{(i+N_{ts})}$. In the following, I define a simple function that incorporated the trend and seasonal model to make predictions for future times, given the past data.

def model_full(ts_data, n_dominant_modes=10, n_pred=24):  
    """
    Given the time series data, predicts the temperature for the next n_pred months
    """
    # train range
    N = len(ts_data)
    t_range_tr = np.arange(0, N).reshape(-1,1)
    linReg_model = LinearRegression(normalize=True, fit_intercept=True) 
    linReg_model.fit(t_range_tr, ts_data)  
    a_, b_ = linReg_model.coef_[0][0], linReg_model.intercept_[0]
    # detrend
    residual = np.ravel((ts_data - a_*t_range_tr + b_).values)
    fft = np.fft.fft(residual - residual.mean())
    dominant_freqs = np.argsort((np.abs(fft)[:N//2]))[-n_dominant_modes:][::-1]
    phase_angles = np.angle(fft[dominant_freqs])
    amplitudes = np.abs(fft[dominant_freqs]) / (N/2)
    ts_model_full = np.zeros((N+n_pred, 1))
    # train+valid range
    t_range_val = np.arange(0, N+n_pred)
    for i,freq in enumerate(dominant_freqs):
        ts_model_full += amplitudes[i]*np.cos(2*np.pi*freq*t_range_val/N + phase_angles[i]).reshape(-1,1)
return ts_model_full + (a_*t_range_val + b_).reshape(-1,1)

We can now use this funtion for the purpose of cross validation. I test the full model to predict temperatures from 2010-2019 while assessing the influence of multiple parameters:

  • Number of training data, i.e., how much do we know from past.
  • Number of frequency modes included

How far back in time works better?

fig, ax = plt.subplots(6, 1, figsize=(12, 18), dpi=200)
plt.subplots_adjust(left=0, bottom=0, right=1, top=1, wspace=0, hspace=0.35)
state = 'AL'
state_df = climate_df[climate_df['state']==state]
state_df.set_index('timestamp', inplace=True)
temp_data = state_df.groupby([(state_df.index.year),(state_df.index.month)]).mean()
num_modes = [25]
num_years_tr = [10, 30, 50]
n_pred = 12 * 10 # 10 years (2010-2019)
t_range = np.arange(0, len(temp_data)).reshape(-1,1)
xticklabel_steps = [25*12, # every 25-years, 1 xticklabel
                    12, # every year, 1 xticklabel
                   ]
axes_colors = ["#fff6f3","#fffcf4","#f3f6f8"]
xticklabel_start_years = [1895, 2010]
sns.set(rc={'figure.facecolor':"whitesmoke",'axes.edgecolor':"gray"})
for i, n_modes in enumerate(num_modes):
    for j, n_years in enumerate(num_years_tr):
        ts_data_tr = temp_data.iloc[-n_years*12-n_pred : -n_pred, :]
        ts_data_val = temp_data.iloc[-n_pred:, :]
        ts_pred = model_full(ts_data_tr, n_dominant_modes=n_modes, n_pred=n_pred)
        ts_pred_tr = ts_pred[:-n_pred]
        ts_pred_val = ts_pred[-n_pred:]
        mae_tr = np.abs(ts_data_tr - ts_pred_tr).values.mean()
        mae_val = np.abs(ts_data_val - ts_pred_val).values.mean()
        ax[j*2+1].plot(t_range[-n_pred:], ts_pred_val, color='orange')
        ax[j*2+1].plot(t_range[-n_pred:], ts_data_val, color='xkcd:crimson', linestyle=':')
        ax[j*2].plot(t_range[-n_pred:], ts_data_val, color='orange', label=f'val pred (MAE={mae_val:.2f}$^\circ\!F$)')
        ax[j*2].plot(t_range[-n_years*12-n_pred:-n_pred], ts_data_tr, color='blue', label=f'tr pred (MAE={mae_tr:.2f}$^\circ\!F$)')
        ax[j*2].plot(t_range, temp_data.values, color='xkcd:crimson', linestyle=':', label='real data')
        for k,(start_year,step) in enumerate(zip(xticklabel_start_years,xticklabel_steps)):
            func = lambda x, pos: f"{start_year-115+x/12:.0f}"
            ax[j*2+k].xaxis.set_major_locator(ticker.MultipleLocator(step))
            ax[j*2+k].xaxis.set_major_formatter(ticker.FuncFormatter(func))
            ax[j*2+k].set_facecolor(axes_colors[j])
            ax[j*2+k].grid(False)
        ax[j*2].legend(bbox_to_anchor=(1,-0.18,0,0), loc='center left', fontsize=10, facecolor=axes_colors[j])
        ax[j*2].set_title(f'Train and validation data - Model trained using past {n_years} years', fontsize=16, y=1.02)
plt.show()

Assessing the influence of num_modes and train_size

fig, ax = plt.subplots(3, 3, figsize=(12, 12), dpi=100)
plt.subplots_adjust(left=0, bottom=0, right=1, top=1, wspace=0.15, hspace=0.4)
state = 'AL'
state_df = climate_df[climate_df['state']==state]
state_df.set_index('timestamp', inplace=True)
temp_data = state_df.groupby([(state_df.index.year),(state_df.index.month)]).mean()
num_modes = [1, 5, 10]
num_years_tr = [10, 30, 50]
n_pred = 12 * 10 # 10 years (2010-2019)
t_range = np.arange(0, len(temp_data)).reshape(-1,1)
xticklabel_step = 5*12 # every 5 years, 1 xticklabel
axes_colors = col_to_hex('Spectral', len(num_modes)*len(num_years_tr), 0.08)
xticklabel_start_year = 2010
sns.set(rc={'figure.facecolor':"white",'axes.edgecolor':"gray",'legend.facecolor':"whitesmoke"})
for i, n_modes in enumerate(num_modes):
    for j, n_years in enumerate(num_years_tr):
        ts_data_tr = temp_data.iloc[-n_years*12-n_pred : -n_pred, :]
        ts_data_val = temp_data.iloc[-n_pred:, :]
        ts_pred = model_full(ts_data_tr, n_dominant_modes=n_modes, n_pred=n_pred)
        ts_pred_tr = ts_pred[:-n_pred]
        ts_pred_val = ts_pred[-n_pred:]
        mae_tr = np.abs(ts_data_tr - ts_pred_tr).values.mean()
        mae_val = np.abs(ts_data_val - ts_pred_val).values.mean()
        ax[i][j].plot(t_range[-n_pred:], ts_pred_val, 
                       color='orange', 
                       label= f'tr pred (MAE={mae_tr:.2f}$^\circ\!F$) - ' \
                            + f'val pred (MAE={mae_val:.2f}$^\circ\!F$)'
                      )
        ax[i][j].plot(t_range[-n_pred:], ts_data_val, color='xkcd:crimson', linestyle=':', label='real data')
        func = lambda x, pos: f"{xticklabel_start_year-115+x/12:.0f}"
        ax[i][j].xaxis.set_major_locator(ticker.MultipleLocator(xticklabel_step))
        ax[i][j].xaxis.set_major_formatter(ticker.FuncFormatter(func))
        ax[i][j].set_facecolor(axes_colors[i*len(num_years_tr)+j])
        ax[i][j].grid(False)
        ax[i][j].legend(bbox_to_anchor=(0.5,1,0,0), loc='lower center', fontsize=10)
        ax[i][j].set_title(f'Using {n_years} years and {n_modes} #modes', fontsize=16, y=1.18)
plt.suptitle(f'Temperature predictions of {state} for 2010-2019 period using different n_modes and n_years', 
             fontsize=22, y=1.12)
plt.show()

In this specific case (state='AL', n_pred=10, and pred_period=2010-2019), trainig the model using data from past 30 years and n_modes=10 gives the lowest Mean Absolute Error (MAE) for validation data. Notice that regardless of the number of modes and years, the highest errors in predictions occur in Winter.

fig, ax = plt.subplots(1,1,figsize=(12,12),dpi=100)
sns.set(rc={'figure.facecolor':"whitesmoke"})
ax.grid(False)
n_modes = 10
n_years_tr = 30
ts_data_tr = temp_data.iloc[-n_years*12-n_pred : -n_pred, :]
ts_data_val = temp_data.iloc[-n_pred:, :]
ts_pred = model_full(ts_data_tr, n_dominant_modes=n_modes, n_pred=n_pred)
ts_pred_val = ts_pred[-n_pred:]
# Boxplot
residual_data = ts_data_val - ts_pred_val
residual_data.index.set_names(['year','month'], inplace=True)
residual_data.reset_index(inplace=True)
residual_data.columns = ['year', 'month', 'error']
season_groupby = month_to_season_arr[residual_data.month]
groupby_ = [residual_data.year, season_groupby]
seasonal_residual = residual_data.groupby(groupby_).mean().drop(columns='month')
seasonal_residual.reset_index(inplace=True)
seasonal_residual.columns = ['year', 'season', 'error']
seasonal_residual['error'] = seasonal_residual['error'].apply(lambda x: abs(x))
g = sns.boxplot(x="season", y="error",width=0.8,
             palette="pastel", data=seasonal_residual, 
                  ax=ax, order=['Spring', 'Summer', 'Fall', 'Winter'], showmeans=True)
g.set_xlabel('season', fontsize=14)
g.set_ylabel('MAE', fontsize=14)
plt.suptitle(f'MAE of temperature predictions in {state} from 2010-2019 grouped by season', fontsize=18, y=0.92);
plt.show()

We can perform a more comprehensive search by expanding the search space. The following code snippets finds the 5 most accurate predictions of validation data for different combinations of n_mode and n_year.

state = 'AL'
state_df = climate_df[climate_df['state']==state]
state_df.set_index('timestamp', inplace=True)
temp_data = state_df.groupby([(state_df.index.year),(state_df.index.month)]).mean()
num_modes = [1, 2, 5, 10, 20, 50, 100]
num_years_tr = [1, 2, 5, 10, 20, 30, 40, 50, 60, 70]
n_pred = 12 * 10 # 10 years (2010-2019)
t_range = np.arange(0, len(temp_data)).reshape(-1,1)
results = []
for i, n_modes in enumerate(num_modes):
    for j, n_years in enumerate(num_years_tr):
        ts_data_tr = temp_data.iloc[-n_years*12-n_pred : -n_pred, :]
        ts_data_val = temp_data.iloc[-n_pred:, :]
        ts_pred = model_full(ts_data_tr, n_dominant_modes=n_modes, n_pred=n_pred)
        ts_pred_tr = ts_pred[:-n_pred]
        ts_pred_val = ts_pred[-n_pred:]
        mae_tr = np.abs(ts_data_tr - ts_pred_tr).values.mean()
        mae_val = np.abs(ts_data_val - ts_pred_val).values.mean()
        results.append([state, n_modes, n_years, mae_tr, mae_val, mae_val-mae_tr])
result_df = pd.DataFrame(data=results,
                         columns = ['state', 'num_modes', 'num_years_tr', 'mae_tr', 'mae_val', 'tr_val_diff'])
result_df.sort_values('mae_val', ascending=True).head()
state num_modes num_years_tr mae_tr mae_val tr_val_diff
35 AL 10 30 1.835 2.526 0.691
27 AL 5 50 2.101 2.538 0.437
15 AL 2 30 2.064 2.569 0.506
13 AL 2 10 2.022 2.571 0.549
25 AL 5 30 1.935 2.571 0.637

Repeating this for all states reveals more insights about how the optimal n_modes and num_years_tr evovle spatially, considering the location of each state.

states_results = []
for state in climate_df.state.unique():
    state_df = climate_df[climate_df['state']==state]
    state_df.set_index('timestamp', inplace=True)
    temp_data = state_df.groupby([(state_df.index.year),(state_df.index.month)]).mean()
    mean_temp = temp_data.values.mean()
    std_temp = temp_data.values.std()  
    num_modes = [2, 10, 50, 100]
    num_years_tr = [2, 10, 50, 100]
    n_pred = 12 * 10 # 10 years (2010-2019)
    t_range = np.arange(0, len(temp_data)).reshape(-1,1)
    results = []
    for i, n_modes in enumerate(num_modes):
        for j, n_years in enumerate(num_years_tr):
            ts_data_tr = temp_data.iloc[-n_years*12-n_pred : -n_pred, :]
            ts_data_val = temp_data.iloc[-n_pred:, :]
            ts_pred = model_full(ts_data_tr, n_dominant_modes=n_modes, n_pred=n_pred)
            ts_pred_tr = ts_pred[:-n_pred]
            ts_pred_val = ts_pred[-n_pred:]
            mae_tr = np.abs(ts_data_tr - ts_pred_tr).values.mean()
            mae_val = np.abs(ts_data_val - ts_pred_val).values.mean()
            results.append([state, mean_temp, std_temp, n_modes, n_years, mae_tr, mae_val, mae_val-mae_tr])
    state_best = pd.DataFrame(data=results,
                              columns=['state', 
                                       'mean_temp', 
                                       'std_temp', 
                                       'num_modes', 
                                       'num_years_tr', 
                                       'mae_tr', 
                                       'mae_val', 
                                       'tr_val_diff']
                             ).sort_values('mae_val', ascending=True) \
                              .iloc[0,:] \
                              .values
    states_results.append(state_best)
all_states_df = pd.DataFrame(data=states_results,
                             columns=['state', 
                                      'mean_temp', 
                                      'std_temp', 
                                       'num_modes', 
                                      'num_years_tr', 
                                      'mae_tr', 
                                      'mae_val', 
                                      'tr_val_diff'])
all_states_df.sort_values('mae_val', ascending=True)
state mean_temp std_temp num_modes num_years_tr mae_tr mae_val tr_val_diff
3 CA 56.557 10.489 2 10 1.673 1.844 0.170
28 NM 53.030 14.197 2 50 1.830 1.922 0.093
7 FL 69.841 9.350 2 50 1.701 1.966 0.265
1 AZ 61.063 13.646 10 50 1.702 1.968 0.266
34 OR 47.886 11.160 2 50 1.945 2.046 0.100
40 TX 64.836 13.604 2 50 1.985 2.091 0.105
44 WA 47.220 11.794 2 50 1.970 2.100 0.131
4 CO 44.107 15.250 2 50 2.141 2.113 -0.028
9 ID 43.753 15.316 2 50 2.413 2.221 -0.192
25 NV 49.748 14.494 2 50 2.366 2.256 -0.109
41 UT 46.122 15.840 10 50 2.152 2.261 0.109
36 RI 49.258 15.440 2 50 1.991 2.279 0.288
18 MA 47.837 15.907 2 50 2.032 2.334 0.302
15 LA 66.474 12.192 2 50 1.995 2.374 0.379
37 SC 62.467 13.241 10 50 1.954 2.375 0.421
30 NC 58.295 13.702 2 50 2.103 2.385 0.282
6 DE 54.343 15.492 2 50 2.203 2.398 0.195
27 NJ 51.821 15.973 2 50 2.190 2.400 0.210
8 GA 63.140 12.783 10 50 1.941 2.401 0.460
5 CT 48.435 16.238 2 50 2.118 2.414 0.296
43 VA 55.455 14.695 2 50 2.196 2.435 0.239
17 MD 54.282 15.572 2 50 2.213 2.448 0.235
16 ME 41.982 17.778 2 100 2.314 2.476 0.162
33 OK 60.098 16.163 2 50 2.420 2.481 0.061
47 WY 41.577 16.629 2 50 2.591 2.503 -0.088
26 NH 43.491 17.483 2 50 2.251 2.525 0.274
2 AR 60.507 14.903 10 50 2.176 2.529 0.352
35 PA 48.654 16.396 2 50 2.437 2.541 0.104
21 MS 63.649 13.313 2 10 2.013 2.542 0.529
0 AL 62.978 13.134 2 10 2.022 2.571 0.549
29 NY 46.209 17.187 2 50 2.435 2.617 0.182
13 KS 54.350 17.846 10 50 2.490 2.667 0.178
45 WV 52.068 15.330 2 50 2.487 2.668 0.181
39 TN 57.676 14.673 2 50 2.387 2.681 0.294
48 AK 31.575 14.669 2 50 2.606 2.687 0.081
19 MI 44.306 17.983 2 10 2.553 2.714 0.160
42 VT 41.730 18.296 2 100 2.564 2.751 0.187
14 KY 55.511 15.648 2 50 2.572 2.758 0.185
24 NE 49.044 18.763 10 50 2.697 2.796 0.098
32 OH 50.627 16.801 2 50 2.684 2.816 0.132
23 MT 41.260 17.082 2 10 2.812 2.849 0.038
22 MO 54.523 17.257 2 50 2.690 2.869 0.178
11 IN 51.603 17.279 2 50 2.752 2.954 0.202
10 IL 52.155 17.867 2 100 2.870 3.012 0.142
46 WI 43.142 19.865 2 10 2.910 3.136 0.226
12 IA 47.692 19.860 2 100 3.181 3.272 0.091
38 SD 44.704 20.619 2 100 3.500 3.322 -0.178
20 MN 41.726 21.786 2 100 3.402 3.373 -0.029
31 ND 39.782 22.189 2 10 3.510 3.412 -0.098
sns.set(rc={'axes.facecolor':'whitesmoke', 'figure.facecolor':'lightgray', 
            'grid.color':'gainsboro', 'grid.linestyle':':'})
fig, ax = plt.subplots(1, 1, figsize=(4, 4), dpi=100)
sns.despine(left=False)
max_count = all_states_df[['num_modes', 'num_years_tr']].value_counts().max()
for i,n_modes in enumerate(num_modes):
    for j,n_years in enumerate(num_years_tr):
        count = len(all_states_df[
                                  (all_states_df['num_modes']==n_modes) 
                                  & 
                                  (all_states_df['num_years_tr']==n_years)
                                 ])
        ax.scatter(i+1, j+1, s=count*10, c='r', alpha=count/max_count);
plt.xlabel('n_modes')
plt.ylabel('n_years')
ax.set_xticks(np.arange(1,len(num_modes)+1))
ax.set_yticks(np.arange(1,len(num_years_tr)+1))
ax.set_xticklabels(num_modes)
ax.set_yticklabels(num_years_tr)
plt.show()

Figure shows that in most cases, (n_modes=2, n_years=50) pair gives the most accurate result. A correlation plot might reveal more interesting details regarding the potential relationships among the variables.

fig, ax = plt.subplots(1, 1, figsize=(14,14))
      corr = all_states_df.corr()
      g = sns.heatmap(
          corr, 
          vmin=-1, vmax=1, center=0,
          cmap='Spectral',
          square=True,
          ax=ax,
          cbar_kws={'shrink': 0.8}
      )
      plt.setp(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
      ax.set_title('Correlation Matrix', fontsize=18, y=1.05)
      ax.xaxis.tick_top();