https://preview.redd.it/c59sra8nwb3a1.png?width=1190&format=png&auto=webp&s=80b3f1a83d190ac0349ec97908aa806aaa03abc3

Machine learning progress is plagued by the conflict between competing ideas, with no shortage of failed reviews, underdelivering models, and failed investments in expensive over-engineered solutions.

We don't subscribe the Deep Learning hype for time series and present a fully reproducible experiment that shows that:

A simple statistical ensemble outperforms most individual deep-learning models.
A simple statistical ensemble is 25,000 faster and only slightly less accurate than an ensemble of deep learning models.

In other words, deep-learning ensembles outperform statistical ensembles just by 0.36 points in SMAPE. However, the DL ensemble takes more than 14 days to run and costs around USD 11,000, while the statistical ensemble takes 6 minutes to run and costs $0.5c.

For the 3,003 series of M3, these are the results.

https://preview.redd.it/89bhlcg9wb3a1.png?width=1678&format=png&auto=webp&s=e5471331b081142ba201b81ba3346a890d474c50

In conclusion: in terms of speed, costs, simplicity and interpretability, deep learning is far behind the simple statistical ensemble. In terms of accuracy, they are rather close.

You can read the full report and reproduce the experiments in this Github repo: https://github.com/Nixtla/statsforecast/tree/main/experiments/m3

Comments

You must log in or register to comment.

picardythird t1_iyj138u wrote on December 1, 2022 at 7:35 PM

IIRC there was a recent paper that demonstrated how almost all deep learning approaches for time series forecasting use flawed evaluation procedures, resulting in misleading claims about performance and terrible out-of-distribution performance.

whatsafrigger t1_iyjaub9 wrote on December 1, 2022 at 8:37 PM

It's so so so important to set up good experiments with solid baselines and comparisons to other methods.

notdelet t1_iyjhvqd wrote on December 1, 2022 at 9:21 PM

If you use a flawed evaluation procedure, does a solid baseline do you any good?

Ulfgardleo t1_iylr164 wrote on December 2, 2022 at 9:30 AM

The "and" in the post you replied to was a logical "and". The best evaluation procedure does not help if you use poor, underperforming baselines.

csreid t1_iykq7xn wrote on December 2, 2022 at 2:48 AM

And it's sometimes kinda hard to realize you're doing a bad job, especially if your bunk experiments give good results

I didn't have a ton of guidance when I was writing my thesis (so, my first actual research work) and was so disheartened when I realized my excellent groundbreaking results were actually just from bad experimental setup.

Still published tho! ^^jk

Pikalima t1_iylah8s wrote on December 2, 2022 at 5:51 AM

Sometimes I consider retracting my very first paper because of this.

maxToTheJ t1_iyjw8b8 wrote on December 1, 2022 at 10:59 PM

A lot of people are doing hyperopt with how they set up their experiment to get better results so that they get in prestigious conferences.

SlowFourierT198 t1_iyjyu05 wrote on December 1, 2022 at 11:18 PM

By any chance do you have the name or a reference?

peepeeECKSDEE t1_iykk4dx wrote on December 2, 2022 at 2:02 AM

There's https://arxiv.org/abs/2205.13504, but that's specifically targeted at transformers.

kraegarthegreat t1_iyolbmo wrote on December 2, 2022 at 11:08 PM

This PLAGUES my research.

The amount of detail that most papers provide about their statistical methods used as a baseline is not enough to replicate. "We outperformed ARIMA". Didn't provide values, etc.

uoftsuxalot t1_iyjrkll wrote on December 1, 2022 at 10:26 PM

I would say forecasting in general is bs.

ragamufin t1_iyl761r wrote on December 2, 2022 at 5:16 AM

I’ve been doing it for a decade+ and I’m inclined to agree but it pays well and there’s no shortage of buyers. Even straight up named a model GIPSy once with a crystal ball logo, had a pretty good run.

uoftsuxalot t1_iyshw3h wrote on December 3, 2022 at 8:37 PM

Lol, I'm minus 7 and you're positive 7 karma yet agreeing 😂. Reddit is so stupid sometimes

visualard t1_iyk19ac wrote on December 1, 2022 at 11:36 PM

Then what is your take on physics?

butyrospermumparkii t1_iyjt6yu wrote on December 1, 2022 at 10:38 PM

Why would you say that?

marr75 t1_iyjvo41 wrote on December 1, 2022 at 10:55 PM

That answer is hard to predict.

butyrospermumparkii t1_iyk10ec wrote on December 1, 2022 at 11:34 PM

A lot of time series' are really easy to predict to an acceptable level though.

No-Yogurtcloset-6838 t1_iyjcar2 wrote on December 1, 2022 at 8:46 PM

I will stick to my Exponential Smoothing good old Boomer technology.

The obvious implication of publish or perish mentality is that you cannot trust papers anymore, given all the hastily produced and broken Deep Learning conference methods.

StefaniaLVS t1_iyjtxji wrote on December 1, 2022 at 10:43 PM

Hahaha hundreds of days burning GPUs. One can only start to suspect that the purpose of the conferences and deep learning literature is to promote GPU usage rather than improve the forecasting methods knowledge.

💵💵🤖💵💵

Nowado t1_iyjz1gv wrote on December 1, 2022 at 11:20 PM

I'm not saying there is some conflict of interest, but...

jonestown_aloha t1_iylozc1 wrote on December 2, 2022 at 8:59 AM

what do you mean, conflict of interest? doesn't anyone else just buy auditoriums for their besties?

obsquire t1_iyjgih0 wrote on December 1, 2022 at 9:12 PM

But those conference papered are Peer Reviewed (TM), the gold standard of those who Believe Science, and hence beyond reproach. You are hereby cancelled.

[deleted] t1_iylst8q wrote on December 2, 2022 at 9:57 AM

[deleted]

CyberPun-K t1_iyj4snj wrote on December 1, 2022 at 7:59 PM

The M3 dataset consists only of 3,003 series, a minimal improvement of DL is not a surprise. Everybody knows that neural networks require large datasets to show substantial improvements over statistical baselines.

What is truly surprising is the time it takes to train the networks, 13 days for thousand series

=> there must be something broken with the experiments

HateRedditCantQuitit t1_iyj6yb6 wrote on December 1, 2022 at 8:12 PM

14 days is 20k minutes, so it’s about 6.7 minutes per time series. I don’t know how many models are in the ensemble, but let’s assume it’s 13 models for even math, making an average deep model take 30s to train on an average time series.

Is that so crazy?

CyberPun-K t1_iyj7gb9 wrote on December 1, 2022 at 8:16 PM

All the models are global models, trained using cross learning. Not single models per series. Unless the experiments were done like that.

I_LOVE_SOURCES t1_iykxl0a wrote on December 2, 2022 at 3:48 AM

…. am i failing to detect humour/sarcasm? those words don’t appear to say anything

BrisklyBrusque t1_iyj6bja wrote on December 1, 2022 at 8:09 PM

13 days to tune multiple deep neural networks is not at all unrealistic depending on the number of gpus.

CyberPun-K t1_iyj6q2r wrote on December 1, 2022 at 8:11 PM

NBEATs hyper-parameters are minimally explored in the original paper the ensemble was not tuned. There is something broken with the reported times.

Historical_Ad2338 t1_iylgux6 wrote on December 2, 2022 at 7:07 AM

I was thinking the same thing when I looked into this. I'm not sure if the experiments are necessarily 'broken' (there may be at least reasonable justification for why it took 13 days to train), but the first point about dataset size is a smoking gun.

mantissa t1_iylhzuj wrote on December 2, 2022 at 7:22 AM

I have not read the paper yet, but the time DL ensemble takes may be due to some kind of hyperparameter search

cristianic18 t1_iyixd70 wrote on December 1, 2022 at 7:10 PM

Also, how would someone know this particular combination of stats methods in the ensemble will produce good results beforehand?

SherbertTiny2366 t1_iyj00d2 wrote on December 1, 2022 at 7:28 PM

>This ensemble is formed by averaging four statistical models: AutoARIMA, ETS, CES and DynamicOptimizedTheta. This combination won sixth place and was the simplest ensemble among the top 10 performers in the M4 competition.

TheBrain85 t1_iymw874 wrote on December 2, 2022 at 4:11 PM

Pretty biased selection method: the best ensemble in the M4 competition, evaluated on the M3 competition. Although I'm not familiar with these datasets, they're from the same author, so presumably they have significant overlap and similarity. The real question is how hard is it to find such an ensemble without overfitting to the dataset.

SherbertTiny2366 t1_iynjxon wrote on December 2, 2022 at 6:47 PM

How is it biased to try good-performing ensembles in another data set?

And how is that overfitting?

Furthermore, just because the data sets begin with "M" it does not mean that they "have significant overlap and similarity. "

TheBrain85 t1_iyp1qrz wrote on December 3, 2022 at 1:17 AM

Because if there's overlap in the datasets, or they contain similar data, the exact ensemble you use is essentially an optimized hyperparameter specific for the dataset. It is exactly the reason that for any hyperparameter optimization cross-validation is used on a set separate from the test set. So using the results on the M4 dataset is akin to optimizing hyperparameters on the test set, which is a form of overfitting.

The datasets are from the same author, same series of competitions: https://en.wikipedia.org/wiki/Makridakis_Competitions#Fourth_competition,_started_on_January_1,_2018,_ended_on_May_31,_2018

"The M4 extended and replicated the results of the previous three competitions"

WikiSummarizerBot t1_iyp1s69 wrote on December 3, 2022 at 1:17 AM

Makridakis Competitions

Fourth competition, started on January 1, 2018, ended on May 31, 2018

>The fourth competition, M4, was announced in November 2017. The competition started on January 1, 2018 and ended on May 31, 2018. Initial results were published in the International Journal of Forecasting on June 21, 2018. The M4 extended and replicated the results of the previous three competitions, using an extended and diverse set of time series to identify the most accurate forecasting method(s) for different types of predictions.

^([ )^(F.A.Q)^( | )^(Opt Out)^( | )^(Opt Out Of Subreddit)^( | )^(GitHub)^( ] Downvote to remove | v1.5)

SherbertTiny2366 t1_iyq6prw wrote on December 3, 2022 at 8:01 AM

There is no overlap at all. It’s a completely new dataset. There might be similarities in the sense that there are time series or certain frequencies but in no way could it be the talk of “training in the test” set.

Puzzleheaded_Pin_379 t1_iylb596 wrote on December 2, 2022 at 5:58 AM

In practice you don’t, but combiniation forecast still works. This is like saying, “how did someone know that the Total Stock Market Index would outperform bitcoin beforehand”. Combining forecast has been studied in the literature and in practice. It is effective.

dataslacker t1_iyjblnp wrote on December 1, 2022 at 8:42 PM

I’m going to read this paper in detail but I’m wondering if there’s any insight into why DL methods underperform in TS prediction?

marr75 t1_iyjvtdc wrote on December 1, 2022 at 10:57 PM

Just guessing here, but: overfitting.

Internal-Diet-514 t1_iykhg3s wrote on December 2, 2022 at 1:41 AM

I think so too, I’m confused why they would need to train for 14 days, from skimming the paper it doesn’t seem like the dataset itself is that large. I bet a DL solution that was parameterized correctly to the problem would outperform the traditional statistical approaches.

marr75 t1_iykwulm wrote on December 2, 2022 at 3:42 AM

While I agree with your general statement, my gut says a well parameterized/regularized deep learning solution would perform as well as an ensemble of statistical approaches (without the expertise needed to select the statistical approaches) but would be harder to explain/interpret.

TheDrownedKraken t1_iyko6jf wrote on December 2, 2022 at 2:33 AM

I’m just curious, why do you think that?

Internal-Diet-514 t1_iymjci2 wrote on December 2, 2022 at 2:41 PM

If a model has more parameters than datapoints in the training set it can quickly just learn the training set resulting in an over-fit model. You don’t always need 16+ attention heads to have the best model for a given dataset. A single self attention layer with one head still has the ability to model more complex relationships among the inputs than something like arima would.

kraegarthegreat t1_iyor5g6 wrote on December 2, 2022 at 11:53 PM

This is something I have found in my research. I keep seeing people making models with millions of parameters when I am able to achieve 99% of the performance with roughly 1k.

TropicalAudio t1_iylsprn wrote on December 2, 2022 at 9:55 AM

Little need to speculate in this case: they're trying to fit giant models on a dataset that's a fraction of a megabyte, without any targeted pretraining or prior. That's like trying to prove trains are slower than running humans by having the two compete in a 100m race from standstill. The biggest set (monthly observations) is around 105kB of data. If anyone is surprised your average 10GB+ network doesn't perform very well there, well... I suppose now you know.

marr75 t1_iymo8k3 wrote on December 2, 2022 at 3:17 PM

Yeah

> Just guessing here, but

is a common US English idiom that typically means, "Obviously".

You're absolutely right, though. Just by comparing the training data to the training process and serialized weights, you can see how clearly this should overfit. Once your model is noticeably bigger than a dictionary of X, Y pairs of all of your training data, it's very hard to avoid overfitting.

I volunteer with a group that develops interest and skills in science and tech for kids from historically excluded groups. I was teaching a lab on CV last month and my best student was like, "What if I train for 20 epochs, tho? What about 30?" and the performance improved (but didn't generalize as well). He didn't understand generalization yet so instead, he looked at the improvement trend and had a lightbulb moment and was like, "What if I train for 10,000 epochs???" I should check to see if his name is on the list of collaborators for the paper 😂

psyyduck t1_iykfb3f wrote on December 2, 2022 at 1:24 AM

My guess is it’s the same reason we don’t have self-driving cars: bad out of distribution performance. Teslas get confused when they see new leaves where they’ve never been seen before. In the real world, distributions change a lot over time.

TrueBirch t1_iymehou wrote on December 2, 2022 at 2:03 PM

In addition to what other people have said, I'll add this: classical methods work really well. In fields like text and image generation, we didn't have great approaches 20 years ago, and DL models represented a massive improvement.

ThePhantomPhoton t1_iyj7b4a wrote on December 1, 2022 at 8:15 PM

Depends on the problem. For physical phenomena, statistical techniques are very effective. For more abstract applications, like language and vision, I just don’t know how the purely statistical methods could compete.

TotallyNotGunnar t1_iyjpbzs wrote on December 1, 2022 at 10:11 PM

Even then. I dabble in image processing at work and haven't found a need for deep learning yet. Every time, there's some trick I can pull with a rule based classifier to address the business need. It's like Duck Hunt: why recognize ducks when you can scan for white vs. black pixels?

ThePhantomPhoton t1_iyk2c66 wrote on December 1, 2022 at 11:44 PM

Upvoted because I agree with you-- for many simple image problems you can even just grayscale and use the distance from the Frobenius Norm of each class as input to a logistic regression and nail many of the cases.

TrueBirch t1_iymfk2r wrote on December 2, 2022 at 2:11 PM

When I first read your comment, I thought you were still talking about Duck Hunt. I'd read the heck out of that whitepaper.

ragamufin t1_iyl7qwv wrote on December 2, 2022 at 5:22 AM

Amen we’ve been doing satellite image time series analytics and deep learning keeps getting pushed off in favor of classification models based on complex features

bushrod t1_iyjxns1 wrote on December 1, 2022 at 11:10 PM

The analysis relates to time series prediction problems. Isn't it fair to say vision and language do not fall under that umbrella?

mtocrat t1_iyk1n65 wrote on December 1, 2022 at 11:39 PM

Consider spoken language, and you're back in the realm of time-series. Obviously simple statistical methods can't deal with those though.

bushrod t1_iyk33jc wrote on December 1, 2022 at 11:50 PM

Right, even though language is a form of time series, in practice it doesn't use TSP methods. Transformers are not surprisingly being applied to TSP problems though.

Warhouse512 t1_iykw25k wrote on December 2, 2022 at 3:35 AM

Eh, predicting where pedestrians are going, or predicting next frames in general. Even images have temporal forecasting use cases

ThePhantomPhoton t1_iyk2wnq wrote on December 1, 2022 at 11:49 PM

I think you have a good argument for images, but language is more challenging because we rely on positional encodings (a kind of "time") to provide us with contextual clues which beat out the following form of statistical language model: Pr{x_{t+1}|x_0, x_1, ..., x_{t}} (Edit-- that is, predicting the next word in sequence given all preceding words in the sequence)

eeaxoe t1_iyn2zwu wrote on December 2, 2022 at 4:56 PM

Tabular data is another problem setting where DL has a tough time stacking up to simpler statistical or even ML methods.

cristianic18 t1_iyixblu wrote on December 1, 2022 at 7:10 PM

The results are interesting, but you should include more recent deep learning approaches (not only from GluonTS).

GreatBigBagOfNope t1_iyk3y0g wrote on December 1, 2022 at 11:56 PM

Yes, DL is a sophisticated tool for the most intractable of tasks, and for most problems is like using the Death Star to crack a nut. This is well known and should just be a normal known thing any analyst of any flavour should have in mind - if you're using DL, especially anything not really big or that isn't natural language or image related, it should be for a good reason, because either a random forest, a GAM, or an autofit ARIMA will get you 80+% of the way there 80+% of the time for tabular data. Not everything needs to start with the biggest guns.

TrueBirch t1_iymerps wrote on December 2, 2022 at 2:05 PM

>like using the Death Star to crack a nut

Or a sledgehammer.

I completely agree with you. I instruct the juniors where I work to start with the most basic possible statistical tests and add complexity only when necessary. A good-enough linear regression is easier to implement, replicate, and understand than a slightly-improved DL model.

michelin_chalupa t1_iykvrqx wrote on December 2, 2022 at 3:33 AM

Isn’t it common knowledge that deep learning is usually not the best solution for modeling time series?

Puzzleheaded_Pin_379 t1_iylbioq wrote on December 2, 2022 at 6:03 AM

To forecasters… yes.

TrueBirch t1_iymfbbz wrote on December 2, 2022 at 2:09 PM

Yes, but I've seen many proposals to apply DL to everyday problems where it's not well suited. Heck, even I briefly went down that rabbit hole with a graph theory problem at work. Tried out a basic greedy algorithm first and it worked well enough that I didn't see the need to get any more complicated.

SrPinko t1_iyjsraw wrote on December 1, 2022 at 10:35 PM

I agree, for univariate timer series an statistical model should be enough in the most of the cases; however, I still thinks that DL models would outperform statistical models in multivariate time series with a big set of variables, like the MIMIC-III database. Am I wrong with this belief?

mtocrat t1_iyk1se1 wrote on December 1, 2022 at 11:40 PM

Even for univariate time series, when you have the data & complexity, DL will obviously outperform simple methods. Show me the simple statistical method that can generate speech, a univariate time-series.

TrueBirch t1_iymf42w wrote on December 2, 2022 at 2:08 PM

Wouldn't a DL model trained on a waveform just assume you were going to keep repeating the same words over and over?

mtocrat t1_iymi8i7 wrote on December 2, 2022 at 2:33 PM

You could already tape together a deep learning solution consisting of neural speech recognition, an LLM and Wavenet. Counts as a deep learning solution in my book. I'm not sure if anyone has built an end-to-end solution and I expect it would be worse, but I'm sure if someone put their mind and money to it you'd get decent results

kraegarthegreat t1_iyorlrr wrote on December 2, 2022 at 11:56 PM

From my personal experience:

- Univariate with a few timesteps: XGBoost or statistical methods.

- Multivariate with many timesteps: NN-based models.

SrPinko t1_iyov8fk wrote on December 3, 2022 at 12:25 AM

I agree with you

TrueBirch t1_iymf0eo wrote on December 2, 2022 at 2:07 PM

Depends how much data you have and how much signal there is. Separating signal from noise in a high-dimensional time series is always a challenge.

[deleted] t1_iyj3ct1 wrote on December 1, 2022 at 7:50 PM

[deleted]

AceOfSpades0711 t1_iykea0o wrote on December 2, 2022 at 1:16 AM

The current, rather excessive, employment of deep learning methods is majorly motivated by the desire to understand them better through the experience gained in applying them.

A good paper that puts this into perspective is from Lea Breiman called "Statistical Modeling: The Two Cultures". He argues in the paper that data based statistical models are preventing statisticians from new and exciting discoveries with algorithmic models. Coincidentally, the author is the creator of the ensemble idea that you are using here as explanation. Now take into account that this was written in 2001 where ensembles were what deep learning is in 2022.

Basically, deep learning is preferred in order to improve it to a point where it will by far outperform all other methods, which it is believed to have the potential for. For it may one day lead us to new and exciting discoveries.

abhasatin t1_iylsdrn wrote on December 2, 2022 at 9:50 AM

!RemindMe 3 days

RemindMeBot t1_iylsg90 wrote on December 2, 2022 at 9:51 AM

I will be messaging you in 3 days on 2022-12-05 09:50:47 UTC to remind you of this link

2 OTHERS CLICKED THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

bigb0w t1_iykxhz9 wrote on December 2, 2022 at 3:47 AM

Essentially simple statistical models are much more eco-friendlier to the planet!

Sallao t1_iymcgzn wrote on December 2, 2022 at 1:46 PM

Lost 9 months of this shit

TrueBirch t1_iymdxy3 wrote on December 2, 2022 at 1:58 PM

Great writeup! Reminds me of the excellently named "Cracking nuts with a sledgehammer: when modern graph neural networks do worse than classical greedy algorithms" (https://arxiv.org/abs/2206.13211).

The_Bundaberg_Joey t1_iynpo1a wrote on December 2, 2022 at 7:26 PM

Thanks for sharing the link! This’ll actually work really nicely for a paper I’m writing!

TrueBirch t1_iynqhc9 wrote on December 2, 2022 at 7:31 PM

Happy to help!

[deleted] t1_iyitfkz wrote on December 1, 2022 at 6:45 PM

[deleted]

serge_cell t1_j05qcrj wrote on December 14, 2022 at 6:45 AM

DL is not working well on low-dimentional samples data, data with low correlation between sample elements, and especially bad for time series prediction which is both. Many people put that kind of senseless projects (DL for time series) on their CV and that is instant black mark for candidate, at least for me. They say "but that approach did work!" I ask "did you try anything else?" "No".

[deleted] t1_iykmkxm wrote on December 2, 2022 at 2:21 AM

[deleted]