Comments

You must log in or register to comment.

michaelaalcorn t1_iyyhfhw wrote

Training a single model on three target variables is equivalent to training three separate models that have shared parameters except for the final layer (assuming a mean squared error loss in both cases), so training a single model effectively regularizes the three models. Whether or not this is a good thing will depend on the dataset, but in the limit of infinite data, three separate models will give you better overall performance than a single model since they won't be regularized.

38

pyepyepie t1_iyzvql1 wrote

Great answer, but I am a little unsure about the last line. If you are using ANN you can get stuck in a local minimum of the loss function, and I am not sure if learning multiple tasks in parallel will not be beneficial for the model. I am not saying you are incorrect, just trying to learn something new :).

edit: my TLDR question is if sharing weights can't prevent getting stuck in local minimum in the case of ANN, i.e., improving performance.

0

trnka t1_iz06hbj wrote

Multi-task learning has a long history with mixed results - sometimes very beneficial, and sometimes it just flops. At my previous job, we had one situation in which it was helpful and another situation in which it was harmful.

In the harmful situation, adding outputs and keeping the other layers the same led to slight reductions in quality at both tasks. I assume that it could've been salvaged if we'd increased the number of parameters -- I think the different outputs were effectively "competing" for hidden params.

Another way to look at this is that multi-task is effective regularization, so you can increase the number of parameters without as much risk of horrible overfitting. If I remember correctly there's research to show that overparameterized networks tend to get stuck in local minima less often.

One last story from the field -- in one of our multi-task learning situations, we found that it was easier to observe local minima by just checking per-output metrics. Two training runs might have the same aggregate metric, but one might be far better at output A and the other far better at output B.

2

pyepyepie t1_iz0wj9m wrote

Super interesting. I like the story about the metrics, very useful for people who are new to data science. Even when it's not solvable (I assume in your case it was but in MARL, for example, sometimes if you just aim for Pareto optimality you get a weird division of "goods") you would rather have two models with x-5% accuracy than 1 model with x+15 and another with x-15 most of the time. We get money to know the systems we build :)

BTW, what you talk about seems related to this https://openai.com/blog/deep-double-descent/ (deep double descent). That phenomenon is clearly magic :D I have heard some explanations about weight initialization at a conference but to be honest I really don't have anything intelligent to say about it, would be interesting to see if it's the standrad type of networks in 20 years.

1

trnka t1_iz1nk60 wrote

Oh interesting paper - I haven't seen that paper before.

For what it's worth, I haven't observed double-descent personally, though I suppose I'd only notice it for sure with training time. We almost always had typical learning curves with epochs - training loss decreases smoothly as expected, and testing loss hits a bottom then starts climbing unless there's a TON of regularization.

We probably would've seen it with the number of model parameters cause we did random searches on those periodically and graphed the correlations. I only remember seeing one peak on those, though we generally didn't evaluate beyond 2x the number of params of our most recent best.

I probably wouldn't have observed the effect with more data because our distribution shifted over the years, for instance in 2020 we got a lot more respiratory infections coming in due to COVID which temporarily decreased numbers then increased them because it's easier to guess than other things.

2

a6nkc7 t1_iyy22ml wrote

Generally, you do it when you want to get some idea of the covariance between the outputs conditional on the inputs.

12

Dubgarden t1_iz0xbac wrote

Could you explain that in a bit more detail please? Im curious.

2

purplebrown_updown t1_iyz8ykc wrote

Efficiency mostly. But it can also be a matter of accuracy. You should also be hyper parameter tuning each model so that becomes cumbersome, especially if you have thousands of outputs.

1

smsorin t1_iyza1zg wrote

If you are inference constrained, it might be better. Since a good chunk of the model is shared you need less compute and perhaps even less time, if you can't paralellize sufficiently. The other comments here have other good arguments.

1

anjmon t1_iyzmeog wrote

On a related note, I am curious about what kind of data are you working on? I am a beginner and looking to try out regression on real and novel datasets.

1

PredictorX1 t1_iyzsby0 wrote

For modeling solutions featuring intermediate calculations (such as the hidden layers of multilayer perceptrons), the hope is that what is learned about each target variable might be "shared" with the others. Whether this effect yields a net gain depends on the nature of the data. Outputs in a multiple-output model which is trained iteratively tend to reach their optimum performance at differing numbers of iterations. There is also the logistical benefit of only having to train one, larger model versus several.

1