jms4607 t1_iqxg1vm wrote on October 3, 2022 at 8:18 PM

Reply to comment by bushrod in [D] Why restrict to using a linear function to represent neurons? by MLNoober

It’s not a clear answer. Our neurons actually have multiplicative effects, not only additive. The paper that talks about it I think is Active Dendrites, something Catastrophoc Forgetting. The real reason we don’t use polynomial is because of the combinatoric scaling of a d variable polynomial. However, a mlp cannot approximate y=x^2 to an arbitrary accuracy on (-inf, inf) no matter how large the size of your network. I can think of a proof of this for sigmoid, tanh, and Relu activations. A polynomial kernel (x^0, x^1, …, x^n) could fit y=x^2 perfectly however. An mlp that allowed you to multiply two inputs to each neuron could also learn the function perfectly. I’d be interested in papers that use multiple activation function and allow input interaction to enforce Occams Razor through weight regularization or something. Sure nets like that would generalize better.

bushrod t1_iqxklya wrote on October 3, 2022 at 8:47 PM

What's the benefit of neural nets being able to approximate analytic functions perfectly on (-inf, inf)? Standard neural nets can approximate to arbitrary accuracy on a bounded range, and training data will always be bounded. If you want to deal with unbounded ranges, there are various ways of doing symbolic regression that are designed for that.

jms4607 t1_iqxuph2 wrote on October 3, 2022 at 9:56 PM

Generalization out of distribution might be the biggest thing holding back ML rn. It’s worth thinking about whether the priors we encode in nns now are to blame. A large mlp is required just to approximate a single neuron. Maybe the unit additive nonlinearity we are using now is too simple. I’m sure there is a sweet spot between complex interactions/few neurons and simple interactions/many neurons.

graphicteadatasci t1_iqzr880 wrote on October 4, 2022 at 8:33 AM

Taylor series are famously bad at generalizing and making predictions on out-of-distribution data. But you are absolutely free to add feature engineering on your inputs. It is very common to take the log of a numeric input and you always standardize your inputs in some way, either trying to bound between 0 and 1 or giving the data mean 0 and std 1. In the same way you could totally look at x*y effects. If you don't have reason why two values should be multiplied with each other then you could try all combinations and feed to a decision forest or logistic regression and see if any come out as being very important.