Basically what the titel says. For me it seems that neither in business nor in literature Super Learners / Stacking is used frequently. Therefore I was wondering why this is the case? Especially since Stacking should guarantee at least equal performance as the base learners used for it. One reason that comes up my mind is the curse of data. As more levels in the architecture we have the more data splits are needed, reducing the available training data for each individual model, thus reducing the model performance. Another thing might be the complexity when building a Stacked Learner. Still that doesn’t see to be that bad of a trade-off. Anything I‘m totally missing here?

Comments

You must log in or register to comment.

chaosmosis t1_j47d0ev wrote on January 13, 2023 at 5:56 PM

In addition to being more straightforward, applying the same total amount of compute to a single model doing end to end learning is often better for performance than splitting up compute between multiple models. As far as I'm aware, there aren't any systematic ways to tell when which method will be preferable, this is just a rule of thumb opinion.

jimmymvp t1_j4fcjly wrote on January 15, 2023 at 8:37 AM

Hm, I'm not sure about that. There's the mixture of experts idea that does not exactly stacking, but rather specializes multiple models to parts of the data so each data point gets assigned to a specific shallow model. What you need then is an assignment rule, mostly done by a classifier and it's been shown that this is cheaper in terms of compute at evaluation time. I'm not sure if the idea is abandoned by now, but Google Brain published a paper on this and there were subsequent works.

chaosmosis t1_j4fpg6g wrote on January 15, 2023 at 11:30 AM

I'd love the reference if you can find it.

jimmymvp t1_j4hy6xm wrote on January 15, 2023 at 9:19 PM

https://ai.googleblog.com/2022/11/mixture-of-experts-with-expert-choice.html?m=1

https://ai.googleblog.com/2022/01/scaling-vision-with-sparse-mixture-of.html?m=1

chaosmosis t1_j4lnsfe wrote on January 16, 2023 at 4:29 PM

Thanks!

jonas__m t1_j47r617 wrote on January 13, 2023 at 7:22 PM

this is one of the many strategies used in autogluon that enables it to outperform other autoML tools on most datasets:
https://arxiv.org/abs/2003.06505

https://arxiv.org/abs/2207.12560

One complaint people raise is regarding latency & complexity of deploying ensemble models, but there are many easy options to deal with this:
https://auto.gluon.ai/stable/tutorials/tabular_prediction/tabular-indepth.html#accelerating-inference

Zealousideal_Low1287 t1_j498b5d wrote on January 14, 2023 at 1:14 AM

Kaggle would like to know your location

ndemir t1_j466c3f wrote on January 13, 2023 at 1:11 PM

It is just one of the tools that you end up using if you are using some kind of AutoML. I just confirmed that with h2o ;) https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html

Worth-Advance-1232 OP t1_j47drjb wrote on January 13, 2023 at 6:00 PM

For me this doesn’t really make all that much sense. Stacking is not a tool in itself, it’s rather a modelling approach. Also from what I can tell in the h2o docs it seems that their Stacked Ensemble has only one level to it and no meta-model. Thus rather than using e. g. the outputs (or probability distribution) of each model to train a new model, it will only use the output for any given input directly to return its final prediction, doesn’t it?

ndemir t1_j4820tc wrote on January 13, 2023 at 8:30 PM

You can define meta learner:
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html

Worth-Advance-1232 OP t1_j48mdcw wrote on January 13, 2023 at 10:39 PM

I’ll look into it, thanks a lot! The main topic/question is still open for me tho.

ndemir t1_j48mkuy wrote on January 13, 2023 at 10:40 PM

In real world; stacking is used. It's usually behind the scenes -- as you have seen in this AutoML application.

Worth-Advance-1232 OP t1_j48mrfn wrote on January 13, 2023 at 10:41 PM

I‘ve used it in production as well, but thought that there is quite few research published about it lately. So I kind of assumed it’s similar in practice.