maybelator

maybelator t1_ixcyhrp wrote

Papers are not selected based on their average, but wether or bot there is a consensus towards acceptance. Show the meta reviewers that you can meaningfully address the reservations of the 4 and 5, or show that they are not valid (be very careful with this route).

Based on the scores, the scale tips in your favor. But the rebuttal will be critical.

1

maybelator t1_iwbxutj wrote

The Huber loss encourages the regularized variable to be close to 0. However, this loss is also smooth: the amplitude of the gradient decreases as the variable nears its stationary point. In consequence, it will have many coordinates close to 0 but not exactly. Achieving true sparsity requires thresholding which adds a a lot of other complications.

In contrast the amplitude of the gradient of the L1 norm (absolute value in dim 1) remain the same no matter how close it gets to 0. The functional has a kink (the subgradient contains a neighborhood of 0). In consequence, if you used a well-suited optimization algorithm, the variable will have true sparsity, i.e. a lot of exact 0.

2

maybelator t1_ivxgacq wrote

> Is the derivative of ReLU at 0.0 equal to NaN, 0 or 1?

The derivative of ReLu is not defined at 0, but its subderivative is and is the set [0,1].

You can pick any value in this set, and you end up with (stochastic) subgradient descent, which converges for small enough learning rates (to a critical point).

For ReLu, the discontinuity are of mass 0 and are not "attractive", ie there is no reason for the iterate to end up exactly at 0, so it can be safely ignored. This is not the case for the L1 norm for example, whose subgradient at 0 is [-1,1]. It present a "kink" at 0 as the subderivative contains a neighborhood of 0, and hence is attractive: your iterate will get stuck there. In these cases, it is recommended to use proximal algorithms, typically forward-backward schemes.

91