Submitted by MLNoober t3_xuogm3 in MachineLearning
MLNoober OP t1_iqwyu29 wrote
Thank you for the replies.
I understand that neural networks can represent non-linear complex functions.
To clarify more,
My question is that a single neuron still computes F(X) = WX + b, which is a linear function.
Why not use a higher order function F(X) = WX^n + W1 X^(n-1) + ... +b.
I can imagine the increase in computational needed to implement this, but neural networks were considered to be time-consuming until we started using GPUs for parallel computations.
So if we ignore the implementation details to accomplish this for large networks, are there any inherent advantages to using higher-order neurons?
bushrod t1_iqx4ze2 wrote
If I'm understanding correctly, you're proposing each link (dendrite) could have a polynomial transfer function as a way to introduce additional nonlinearity. Is that correct?
First of all, there's the significantly increased computational costs (no free lunch). Second, what is it buying you? Neural nets as they're currently formulated can already approximate any function to arbitrary precision. Your method would do that in a different way, but it would be much more inefficient while not adding any additional expressive power. Making the activation function non-monotonic seems like a bad idea for obvious reasons (at least for typical neural nets), and making it more complex than a sigmoid seems pointless. The success of ReLU units relative to sigmoids shows that reducing the complexity of the activation function has benefits without significant drawbacks.
It's not a bad question, but I think there's a clear answer.
jms4607 t1_iqxg1vm wrote
It’s not a clear answer. Our neurons actually have multiplicative effects, not only additive. The paper that talks about it I think is Active Dendrites, something Catastrophoc Forgetting. The real reason we don’t use polynomial is because of the combinatoric scaling of a d variable polynomial. However, a mlp cannot approximate y=x^2 to an arbitrary accuracy on (-inf, inf) no matter how large the size of your network. I can think of a proof of this for sigmoid, tanh, and Relu activations. A polynomial kernel (x^0, x^1, …, x^n) could fit y=x^2 perfectly however. An mlp that allowed you to multiply two inputs to each neuron could also learn the function perfectly. I’d be interested in papers that use multiple activation function and allow input interaction to enforce Occams Razor through weight regularization or something. Sure nets like that would generalize better.
bushrod t1_iqxklya wrote
What's the benefit of neural nets being able to approximate analytic functions perfectly on (-inf, inf)? Standard neural nets can approximate to arbitrary accuracy on a bounded range, and training data will always be bounded. If you want to deal with unbounded ranges, there are various ways of doing symbolic regression that are designed for that.
jms4607 t1_iqxuph2 wrote
Generalization out of distribution might be the biggest thing holding back ML rn. It’s worth thinking about whether the priors we encode in nns now are to blame. A large mlp is required just to approximate a single neuron. Maybe the unit additive nonlinearity we are using now is too simple. I’m sure there is a sweet spot between complex interactions/few neurons and simple interactions/many neurons.
graphicteadatasci t1_iqzr880 wrote
Taylor series are famously bad at generalizing and making predictions on out-of-distribution data. But you are absolutely free to add feature engineering on your inputs. It is very common to take the log of a numeric input and you always standardize your inputs in some way, either trying to bound between 0 and 1 or giving the data mean 0 and std 1. In the same way you could totally look at x*y effects. If you don't have reason why two values should be multiplied with each other then you could try all combinations and feed to a decision forest or logistic regression and see if any come out as being very important.
dumbmachines t1_iqx2g66 wrote
>So if we ignore the implementation details to accomplish this for large networks, are there any inherent advantages to using higher-order neurons?
I don't know what that might be, but there is an inherent advantage in stacking layers of act(WX+b) where act is some non-linear function. Instead of guessing what higher level function you should use for each neuron, you can learn the higher order function by stacking many simpler non-linear functions. That way the solution is general and can work over many different datasets and modalities.
[deleted] t1_iqxo5cg wrote
[removed]
[deleted] t1_iraj0rj wrote
[removed]
Tgs91 t1_ir0n6hb wrote
You are missing the activation function, which is part of the neuron. They're sometimes passed a separate layer, but it's just a way to represent nested functions. So it isnt:
F(X) = WX + b
It is:
F(X) = A(WX + b), where A is a nonlinear function.
You could make A a polynomial function and it would be equivalent to your suggestion. However polynomials have poor convergence properties and are expensive to compute. Early neural nets used sigmoid activations for non-linearity, now various versions of ReLU are most popular. It turns out that basically any non-linear function gives the model enough freedom to approximate any non-linear relationship, because so many neurons are then recombined. In the case of ReLU, it's like using the Epcot Ball to approximate a sphere.
Viewing a single comment thread. View all comments