Submitted by hardmaru t3_ys36do in MachineLearning
jrkirby t1_ivx9xjl wrote
What happens when all the weights to a ReLU neuron are 0? The ReLU function's derivative is discontinuous at zero. I figure in most practical situations this doesn't matter because the odds of many floating point numbers adding up to exactly 0.0 floating point is negligible. But this paper begs the question of what that would do. Is the derivative of ReLU at 0.0 equal to NaN, 0 or 1?
maybelator t1_ivxgacq wrote
> Is the derivative of ReLU at 0.0 equal to NaN, 0 or 1?
The derivative of ReLu is not defined at 0, but its subderivative is and is the set [0,1].
You can pick any value in this set, and you end up with (stochastic) subgradient descent, which converges for small enough learning rates (to a critical point).
For ReLu, the discontinuity are of mass 0 and are not "attractive", ie there is no reason for the iterate to end up exactly at 0, so it can be safely ignored. This is not the case for the L1 norm for example, whose subgradient at 0 is [-1,1]. It present a "kink" at 0 as the subderivative contains a neighborhood of 0, and hence is attractive: your iterate will get stuck there. In these cases, it is recommended to use proximal algorithms, typically forward-backward schemes.
Phoneaccount25732 t1_ivydmgs wrote
I want more comments like this.
9182763498761234 t1_ivy1mud wrote
Cool, thanks for sharing :-)
robbsc t1_ivypqg0 wrote
Thanks for taking the time to type this out
samloveshummus t1_iw1o1jg wrote
This has to be one of the most useful comments I've read in nearly ten years on Reddit! You must be a gifted teacher.
[deleted] t1_iw2kgbt wrote
[deleted]
zimonitrome t1_iwbmzoq wrote
Huber loss let's go.
maybelator t1_iwbpkjo wrote
Not if you want true sparsity !
zimonitrome t1_iwbst8p wrote
Can you elaborate?
maybelator t1_iwbxutj wrote
The Huber loss encourages the regularized variable to be close to 0. However, this loss is also smooth: the amplitude of the gradient decreases as the variable nears its stationary point. In consequence, it will have many coordinates close to 0 but not exactly. Achieving true sparsity requires thresholding which adds a a lot of other complications.
In contrast the amplitude of the gradient of the L1 norm (absolute value in dim 1) remain the same no matter how close it gets to 0. The functional has a kink (the subgradient contains a neighborhood of 0). In consequence, if you used a well-suited optimization algorithm, the variable will have true sparsity, i.e. a lot of exact 0.
zimonitrome t1_iwc14i5 wrote
Wow thanks for the explanation, it does make sense.
I had a pre-conception that all optimizers dealing with any linear functions (kinda like L1 norm) still produce values close to 0.
I can see someone disregarding tiny values when using said sparsity (pruning, quantization) but didn't think that it would be exactly 0.
ThisIsMyStonerAcount t1_ivy34sr wrote
Knowing about subgradients (see other answers) is nice and all, but in the real world what matters is what your framework does. Last time I checked, both pytorch and jax say that the derivative of max(x, 0)
is 0 when x=0.
samloveshummus t1_iw1ofup wrote
Good point. But it's not the end of the world; those frameworks are open source, after all!
Bot-69912020 t1_ivxbxml wrote
I don't know about each specific implementation, but via the definition of subgradients you can get 'derivatives' of convex but non-differentiable functions (which ReLU is).
More formally: A subgradient at a point x of a convex function f is any x' such that f(y) >= f(x) + < x', y - x > for all y. The set of all possible subgradients at a point x is called the subdifferential of f at x.
For more details, see here.
[deleted] t1_ivxslaq wrote
[deleted]
Viewing a single comment thread. View all comments