BrisklyBrusque t1_j0i9ldc wrote on December 16, 2022 at 9:04 PM

Reply to comment by chaosmosis in [R] Are there open research problems in random forests? by SpookyTardigrade

Thanks for the suggestion.

chaosmosis t1_j0ib3ja wrote on December 16, 2022 at 9:14 PM

No problem at all. I'm leaving ML research for at least the next couple years, and I want my best ideas to get adopted by others. I figured out all of the above in a three month summer internship in 2020 and nobody there cared because it couldn't immediately be used to blow things up more effectively, which was incredibly disappointing.

As far as I can tell, nobody but me and this one footnote in an obscure economics paper I've forgotten the citation of has ever noted that ensembles and financial portfolios deal with the same problem if you cast both in terms of control variates. In theory, bridging between the two by way of control variates should allow for stealing lots and lots of ideas from finance literature for ML papers. Would really like seeing someone make something of the connection someday.

chaosmosis t1_j0icgvf wrote on December 16, 2022 at 9:23 PM

As an example, imagine that Bob and Susan are estimating the height of a dinosaur and Bob makes errors that are exaggerated versions of Susan's, so if Susan underestimates its height by ten feet then Bob underestimates it by twenty, or if Susan overestimates its height by thirty feet then Bob overestimates it by forty. You can "artificially construct" a new prediction to average with Susan's predictions by taking the difference between her prediction and Bob's, flipping its sign, and adding it to her prediction. Then you conduct traditional linear averaging on the constructed prediction with Susan's prediction.

Visually, you can think about it as if normal averaging draws a straight line between two different models' individual outputs in R^n , then chooses some point between them, while control variates extend that line further in both directions and allow you to choose a point that's more extreme.

It's a little more complicated with more predictors and when issuing predictions in higher dimensions than in one dimension, but not by much. Intuitively, you have to avoid "overcounting" certain relationships when you're trying to build a flipped predictor. This is why the financial portfolio framework is helpful; they're already used to thinking about correlations between lots of different investments.

The tl;dr version is, you want models with errors that balance each other out.