Submitted by AutoModerator t3_110j0cp in MachineLearning
Downtown_Finance_661 t1_j93g1nw wrote
I want to thank community for this possibility to ask simple time series question. Please don't reply "jump in window" (it is bad advice from statistical PoV since I'm on the second floor)
I'm new to time series topic in particular and in ML in general. I have tried ARDL model with no seasonal part and no exog variables (from statsmodels.tsa.api import ARDL). I'm working with very small dataset of 16 points (see Appendix 1) with strict trend component.
This TS is stationary according to adfuller test inspite of it is not stationary due to simple criteria like "moving average have to be kind of constant". Not sure if this test even applicable for such a small number of points.
Imagine i want to forecast next nine(sic!) points and i have no idea how to choose best number of lags. Hence I fit my model for several different nlags on TS[:-9] dataset (train set) and choose the best lag by comparing MAE/MSE/R2 on TS[-9:] dataset (test set). Best lag is lags = 1.
In spite of all ugliness of the idea to forecast 9 points having 16-9=7 points the prediction plot is well fitted with test data plot. This result convinced me to to go further (from common mathematical sense).
Now I have to decide :
(1) to use the model above (trained on TS[:-9] set) to predict TS[16:26] values for which i have very good R2 on nine predictions.
(2) or i have to refit the lags = 1 model with all my points ( TS[:] ) but without the chance to test it for nine predictions
And i have no idea how to choose the best option, so i decided to research convergence of model's coefficients (m.params). My plan is to fit nine models for nine sets TS[:-9], TS[:-8], TS[:-7],...TS[:-0] and to check whether a and b in nine consecutive models y(t) = a*y(t-1) + b are tending to converge to two a_lim, b_lim constants. They are not. Not even close to convergence. They look random... This is the end, i don't know how to choose.
My very last idea was to freeze b = constant for all nine models and retest the convergence of a under this restriction but i see no such option in ARDL (and to be honest i have no idea how to program ARDL-like function by myself even for lag=1)
My question is: Any ideas what i can and should do?
Btw, in appendix 2 I have tried to research coefficient's convergence for function:
f[i] = 1.01*f[i-1]+0.01+random noise
I see some problems with convergence even in this scenario.
Appendix 1: Demographic data (fact)
year
2006-01-01 87287
2007-01-01 86649
2008-01-01 86036
2009-01-01 85394
2010-01-01 84845
2011-01-01 84542
2012-01-01 84034
2013-01-01 83881
2014-01-01 83414
2015-01-01 83035
2016-01-01 82656
2017-01-01 82280
2018-01-01 81654
2019-01-01 81745
2020-01-01 81614
2021-01-01 81367
Name: num_of_citizens, dtype: int64
​
Appendix 2: convergence in model task
import pandas as pd
# genrate data
f = [1,1]
for i in range(2,2000):
f.append((1.01*f[i-1]+0.01))
print(len(f))
df = pd.DataFrame({'fib_num':f})
df.head(10)
#df.plot(subplots=True, layout=(1,1), legend = True, figsize = (7,7))
import numpy as np
std = (max(f) - min(f))*0.0001
f_noise = [x + np.random.normal(loc = 0, scale = std) for x in f]
print(f'Max = {max(f_noise)}, Min = {min(f_noise)}')
df_noise = pd.DataFrame({'fib_num_noise':f_noise})
#df_noise.plot(subplots=True, layout=(1,1), legend = True, figsize = (5,5))
df = df_noise.rename(columns={'fib_num_noise':'fib_num'})
from statsmodels.tsa.api import ARDL
fib_par = {}
r2s = []
mae = []
rmse = []
for k in range(15, df.shape[0]):
partial_set = np.asarray(df['fib_num'][0:k])
m = ARDL(partial_set, lags=1)
mfitted = m.fit()
partial_set_pred = (mfitted.predict(start = 0, end = k-1))[2:]
r2s.append(r2_score(partial_set[2:],partial_set_pred))
mae.append(mean_absolute_error(partial_set[2:],partial_set_pred))
rmse.append(np.sqrt(mean_squared_error(partial_set[2:],partial_set_pred)))
fib_par[k] = mfitted.params
# print one of the last coeff-s in coef dict:
print(fib_par[df.shape[0]-20])
# this is plot for 'a' (Y = a*Y +b) change to !=1 to see plot for 'b'
for v in range(len(fib_par[15])):
if v != 0:
pd.Series([x[v] for x in fib_par.values()]).rename(v, inplace = True).plot(legend = True, figsize = (25,7), title = 'Model coeffs')
edf = pd.DataFrame({'r2score':r2s, 'mae':mae, 'rmse':rmse}).iloc[:200]
edf.plot(legend = True, figsize = (15,7), subplots=True, layout=(3,1), title = 'Model quality params')
Viewing a single comment thread. View all comments