Downtown_Finance_661

Downtown_Finance_661 t1_j96nj5k wrote

ML is a mathematical discipline. You have to read books to dive into it. Collaboration is possible after you become usefull. Try "Grocking deep learning" for simple introduction to neural networks. Also check classical ml tasks in regression/classification/trees and drill them. This is hard work wich can not be substituted by being part of some community.

Update: Before it you better learn basics of python programming language. Find lectures with homeworks which are not connected with ML itself (16 hours + 40 hours will be enough)

1

Downtown_Finance_661 t1_j93g1nw wrote

I want to thank community for this possibility to ask simple time series question. Please don't reply "jump in window" (it is bad advice from statistical PoV since I'm on the second floor)

I'm new to time series topic in particular and in ML in general. I have tried ARDL model with no seasonal part and no exog variables (from statsmodels.tsa.api import ARDL). I'm working with very small dataset of 16 points (see Appendix 1) with strict trend component.

This TS is stationary according to adfuller test inspite of it is not stationary due to simple criteria like "moving average have to be kind of constant". Not sure if this test even applicable for such a small number of points.

Imagine i want to forecast next nine(sic!) points and i have no idea how to choose best number of lags. Hence I fit my model for several different nlags on TS[:-9] dataset (train set) and choose the best lag by comparing MAE/MSE/R2 on TS[-9:] dataset (test set). Best lag is lags = 1.

In spite of all ugliness of the idea to forecast 9 points having 16-9=7 points the prediction plot is well fitted with test data plot. This result convinced me to to go further (from common mathematical sense).

Now I have to decide :

(1) to use the model above (trained on TS[:-9] set) to predict TS[16:26] values for which i have very good R2 on nine predictions.

(2) or i have to refit the lags = 1 model with all my points ( TS[:] ) but without the chance to test it for nine predictions

And i have no idea how to choose the best option, so i decided to research convergence of model's coefficients (m.params). My plan is to fit nine models for nine sets TS[:-9], TS[:-8], TS[:-7],...TS[:-0] and to check whether a and b in nine consecutive models y(t) = a*y(t-1) + b are tending to converge to two a_lim, b_lim constants. They are not. Not even close to convergence. They look random... This is the end, i don't know how to choose.

My very last idea was to freeze b = constant for all nine models and retest the convergence of a under this restriction but i see no such option in ARDL (and to be honest i have no idea how to program ARDL-like function by myself even for lag=1)

My question is: Any ideas what i can and should do?

Btw, in appendix 2 I have tried to research coefficient's convergence for function:

f[i] = 1.01*f[i-1]+0.01+random noise

I see some problems with convergence even in this scenario.

Appendix 1: Demographic data (fact)

year

2006-01-01 87287

2007-01-01 86649

2008-01-01 86036

2009-01-01 85394

2010-01-01 84845

2011-01-01 84542

2012-01-01 84034

2013-01-01 83881

2014-01-01 83414

2015-01-01 83035

2016-01-01 82656

2017-01-01 82280

2018-01-01 81654

2019-01-01 81745

2020-01-01 81614

2021-01-01 81367

Name: num_of_citizens, dtype: int64

​

Appendix 2: convergence in model task

import pandas as pd

# genrate data

f = [1,1]

for i in range(2,2000):

f.append((1.01*f[i-1]+0.01))

print(len(f))

df = pd.DataFrame({'fib_num':f})

df.head(10)

#df.plot(subplots=True, layout=(1,1), legend = True, figsize = (7,7))

import numpy as np

std = (max(f) - min(f))*0.0001

f_noise = [x + np.random.normal(loc = 0, scale = std) for x in f]

print(f'Max = {max(f_noise)}, Min = {min(f_noise)}')

df_noise = pd.DataFrame({'fib_num_noise':f_noise})

#df_noise.plot(subplots=True, layout=(1,1), legend = True, figsize = (5,5))

df = df_noise.rename(columns={'fib_num_noise':'fib_num'})

from statsmodels.tsa.api import ARDL

fib_par = {}

r2s = []

mae = []

rmse = []

for k in range(15, df.shape[0]):

partial_set = np.asarray(df['fib_num'][0:k])

m = ARDL(partial_set, lags=1)

mfitted = m.fit()

partial_set_pred = (mfitted.predict(start = 0, end = k-1))[2:]

r2s.append(r2_score(partial_set[2:],partial_set_pred))

mae.append(mean_absolute_error(partial_set[2:],partial_set_pred))

rmse.append(np.sqrt(mean_squared_error(partial_set[2:],partial_set_pred)))

fib_par[k] = mfitted.params

# print one of the last coeff-s in coef dict:

print(fib_par[df.shape[0]-20])

# this is plot for 'a' (Y = a*Y +b) change to !=1 to see plot for 'b'

for v in range(len(fib_par[15])):

if v != 0:

pd.Series([x[v] for x in fib_par.values()]).rename(v, inplace = True).plot(legend = True, figsize = (25,7), title = 'Model coeffs')

edf = pd.DataFrame({'r2score':r2s, 'mae':mae, 'rmse':rmse}).iloc[:200]

edf.plot(legend = True, figsize = (15,7), subplots=True, layout=(3,1), title = 'Model quality params')

1