Bot-69912020 t1_itfqpz6 wrote on October 23, 2022 at 9:03 AM

Reply to comment by rehrev in [D] What things did you learn in ML theory that are, in practice, different? by 4bedoe

illustration of why it happens in low-dimensions: https://twitter.com/adad8m/status/1582231644223987712

i think the main problem is that all textbooks introduce the bias-variance tradeoff as something close to a theoretical law, while in reality, it is just an empirical observation and we simply haven't bothered to further check this observation across more settings... until now

comradeswitch t1_itmfawj wrote on October 24, 2022 at 6:44 PM

There's quite a bit of missing context and misconceptions about this in general, but there is a theoretical backing that is really just a mathematical fact that isn't falsifiable by observation.

Bias, variance, and mean squared error for an estimator can be related quite simply. Say the quantity to be estimated is y, the estimate is x (forgive the lack of LaTeX making more conventional names difficult). Then MSE is E[(y - x)^2] = E[x^2] + E[y^2] - 2 E[xy] = Var[x] + Var[y] - 2Cov[x,y] + (E[x]-E[y])^2, and the first 3 terms are typically simplified by assuming some error structure that makes Cov[x,y] = 0, leading to a decomposition of MSE into 3 terms:

Var[y] - the variance of the true underlying quantity that is inherent to any estimator.
Var[x] - the variance of the estimator regardless of bias. (What's typically referred to as the variance in the bias-variance terminology)
(E[y] - E[x])^2 = Bias[x]^2 - the squared bias of the estimator.

Since (1) is constant across all estimators, that means that the difference in MSE for two estimators comes entirely down to (2) and (3).

The bias-variance tradeoff is then:

For a fixed value of mean squared error, decreasing the bias of an estimator must increase the variance by the same amount and vice versa.

An unbiased estimator will have the maximum variance among all estimators with the same MSE. It's entirely possible to improve an estimator's MSE by increasing or decreasing bias. In fact, the two commonly used estimators of sample variance are a great example*.

The most important pieces here are that we're talking about MSE of an estimator, and that for a fixed value of MSE there's a 1:1 tradeoff between bias and variance. It makes no statements about the case when MSE is not held constant, and it's actually very well known in mathematical statistics that biased estimators can provide significant improvements in MSE over unbiased ones. This doesn't hold if you're not considering MSE, or allowing MSE to change. It can be extended to multivariate cases and "estimator" could be anything from a mean of a few samples to the output of a deep belief network.

* Take a = 1/(n-1) * sum(x^2) - 1/(n*(n-1)) * sum(x)^2, and b = (n-1)/n * a. v = Var[x] is the population variance. We have E[a] = v, which means a is unbiased (this is Bessel's correction). And obviously, E[b] = (n-1)/n * v. Thus Bias[b] = E[b] - v = - v / n. Calculation of Var[a] requires knowledge of the 4th moments of x, however, in the case of normally distributed data the MSE of b is less than that of a. And a property that holds in general is that Var[b] = ((n-1)/n)^2 * Var[a], and MSE[b] = Var[b] + (v/n)^2 = 1/n^2*((n-1)^2 Var[a] + v^2). Thus MSE[b] < MSE[a] if (2n - 1)*Var[a] > v^2. This is true for normal distributions and many others.

Bot-69912020 t1_ittvs0g wrote on October 26, 2022 at 7:45 AM

thanks for your clarification - i am sure it is useful for the other readers! But based on your knowledge, you might be interested in our latest preprint, which offers a more general bias-variance decomposition https://arxiv.org/pdf/2210.12256.pdf

comradeswitch t1_itwmbkg wrote on October 26, 2022 at 8:58 PM

Oh interesting! This looks to hit upon a lot of my favorite topics, I'll be taking a more in depth look later.

It's not surprising to me that a decomposition based on Bregman divergences has similar properties as the one for MSE, but the connections through convex conjugates and proper scores is clever.

rehrev t1_itftbm1 wrote on October 23, 2022 at 9:39 AM

It's not a theoretical law. But it is sure as hell makes intuitive sense and I can't really imagine how a complex model may not overfit. I mean, that's what I mean by a complex model, something that is prone to overfitting. Otherwise, what does model complexity even mean?

Lugi t1_itiq8ir wrote on October 23, 2022 at 10:56 PM

>Otherwise, what does model complexity even mean?

People are generally referring to bigger models (#parameters) as more complex.

Come to think of it, redundancy in networks with more parameters can act as a regularizer, by making similar branches have essentially higher learning rate and be less prone to overfitting. Let me give you an example of what I have in mind: a simple network with just one parameter - y = wx. You can pass some data through it, calculate loss, backpropagate to get gradient, and update the weight with it.

But see what happens if we reparametrize w as w1+w2: the gradient for these is gonna be the same as in case of only one parameter, but after the weight update step we will essentially end up moving twice as far, which would be equal to original one parameter case with 2 times bigger learning rate.

Another thing that could be somehow linked to this phenomenon is that on one hand the parameter space of a 1 hidden layer neural network grows exponentially with the number of neurons, and on the other hand the number of equivalent minimums grows factorially, so at some certain number of neurons the factorial takes over and your optimization problem becomes much simpler, because you are always close to your desired minimum. But I don't know shit about high-dimensional math so don't quote me on that.

Bot-69912020 t1_ittw3bj wrote on October 26, 2022 at 7:50 AM

maybe to help your intuition, consider the following: Do more parameters really increase model complexity if they are less fitted? Check out this post https://twitter.com/tengyuma/status/1545101994150531073