Viewing a single comment thread. View all comments

nickkon1 t1_irxid6a wrote

> ML used to be very cautious about data leakage, but this is simply ignored in most cases when its about those models.

I work on economic stuff. Either I am super unlucky or the number of papers that have data leakage is incredibly high. A decent chunk of papers that try to predict some macro-economic data one quarter a head dont leave a gap of one quarter between their training date and the prediction. Their backtest is awesome, the error is small, nice, a new paper! But it cant be used in production since how can I train a model on the 01.09.2022 if I need the data from 1st Oct to 31rd Dec for my target value.

It is incredibly frustrating. There have been papers, master thesis and even a dissertation that did this. I am incredibly frustrated and stopped trusting anything without code/data

16

scarynut t1_irxshd1 wrote

I noticed this on a lot of YouTube stock prediction tutorials. Made me conclude that people are idiots. Shocking that this mistake makes its way into papers..

7

popcornn1 t1_is03bja wrote

Sorry, but, I cannot understand your comment. What you mean by "don't leave gap"? So how they make forecast? Training data from January 2021 to December 2021 and then forecast from October 2021 to December 2021????

1

nickkon1 t1_is09o1x wrote

A lot of papers, articles, youtube videos on time series have the premise:
Our data is dependent on time. Not only does new data come in regularly, it might also happen that the coefficients of our model change over time and important features in 2020 (e.g. the number of people who are ill with covid) are less relevant now in 2022. To combat that, you retrain your model in regular intervals. Let us retrain our model daily.
That is totally fine and a sensible approach.

The key is: How far into the future do you want to predict something?


Because a lot of medium, towardsdatascience, and plenty of other blogs do that: Let us try to predict the 7-day return of a stock.

To train a new model today at t_{n}, I need data from the next week. But since I cant view into the future and do not know the future 7-day return of my stock, I dont have my y variable. The same holds for time step t_{n-1} and so on until I reach time step t_{n-prediction window}. Only there, I can calculate the future 7-day return of my stock with today's information.
This means that the last data point of my training data is always lagging by 7 days from my evaluation date.

The issue is: This becomes a problem only at your most recent data points (specifically the last #{prediction window} data points). Since you are creating a blog, publishing a paper... who cares? You dont really use that model daily for your business anyway. But: You can still do that on your backtest where you iterate through each time step t_{i}, take the last 2 years of training data up until t_{i} and make your prediction.

Your backtest is suddenly a lot better, your error becomes smaller, BAM 80% accuracy on a stock prediction! You beat the live tested performance of your competition! It is a great achievement and let us write a paper about it! But the reality is: Your model is actually unusable in a live setting and the errors you reported from your backtest are wrong. The reason is a subtle way of giving your model info about the future by accident. Throughout the whole backtest you have retrained your model's parameters at time t_{i} with data about your target variable at t_{i+1} to t_{i+prediction_window-1}. You need a gap between your training data and validation/test data.

Specifically in numbers (YYYY-MM-DD):
Wrong:
Training: 2020-10-10 - 2022-10-10
You try to retrain your model on 2022-10-10 and make a prediction on that date.

Correct:
Training: 2020-10-03 - 2022-10-03
You retrain your model on 2022-10-10 and make a prediction on that date. Notice that the last data point of your training data is not today, but today - #{prediction window}

4