nickkon1 t1_is09o1x wrote on October 12, 2022 at 11:12 AM

Reply to comment by popcornn1 in [D] Looking for some critiques on recent development of machine learning by fromnighttilldawn

A lot of papers, articles, youtube videos on time series have the premise:
Our data is dependent on time. Not only does new data come in regularly, it might also happen that the coefficients of our model change over time and important features in 2020 (e.g. the number of people who are ill with covid) are less relevant now in 2022. To combat that, you retrain your model in regular intervals. Let us retrain our model daily.
That is totally fine and a sensible approach.

The key is: How far into the future do you want to predict something?

Because a lot of medium, towardsdatascience, and plenty of other blogs do that: Let us try to predict the 7-day return of a stock.

To train a new model today at t_{n}, I need data from the next week. But since I cant view into the future and do not know the future 7-day return of my stock, I dont have my y variable. The same holds for time step t_{n-1} and so on until I reach time step t_{n-prediction window}. Only there, I can calculate the future 7-day return of my stock with today's information.
This means that the last data point of my training data is always lagging by 7 days from my evaluation date.

The issue is: This becomes a problem only at your most recent data points (specifically the last #{prediction window} data points). Since you are creating a blog, publishing a paper... who cares? You dont really use that model daily for your business anyway. But: You can still do that on your backtest where you iterate through each time step t_{i}, take the last 2 years of training data up until t_{i} and make your prediction.

Your backtest is suddenly a lot better, your error becomes smaller, BAM 80% accuracy on a stock prediction! You beat the live tested performance of your competition! It is a great achievement and let us write a paper about it! But the reality is: Your model is actually unusable in a live setting and the errors you reported from your backtest are wrong. The reason is a subtle way of giving your model info about the future by accident. Throughout the whole backtest you have retrained your model's parameters at time t_{i} with data about your target variable at t_{i+1} to t_{i+prediction_window-1}. You need a gap between your training data and validation/test data.

Specifically in numbers (YYYY-MM-DD):
Wrong:
Training: 2020-10-10 - 2022-10-10
You try to retrain your model on 2022-10-10 and make a prediction on that date.

Correct:
Training: 2020-10-03 - 2022-10-03
You retrain your model on 2022-10-10 and make a prediction on that date. Notice that the last data point of your training data is not today, but today - #{prediction window}