solresol t1_j16ed0l wrote
Reply to comment by [deleted] in [D] Techniques to optimize a model when the loss over the training dataset has a Power Law type curve. by Dartagnjan
As a real-world example that I encountered with a client that sells software to ecommerce stores... they wanted to know the number of products in a typical ecommerce store.
It turns out that there's a power law at work. If you sample N stores and count the number of products in all stores in total, you get XN products. Great! The mean is X.
But if you sample 2N stores, the number of products in total in all the stores is 4XN. That's because you have doubled your chances of finding a store that on its own has 2XN products, and the rest of the stores contribute the 2XN that you would have expected.
When you only sampled N stores, the average number of products per store was X. When you doubled the size of the sample, the average number of products was 2X.
Similar things happen to the variance.
As you increased the sample size, the average number of products goes up as well.
In a sane universe you would expect this to end eventually. This particular client is still growing, still analysing more stores, and they are always finding bigger and bigger stores (stores which on their own have more products than all other stores put together). Eventually they will have analysed every store in the world, and then they will be able to answer the question of "what's the average number of products in an ecommerce store that exists right now?"
But who knows? Maybe stores are being created algorithmically. It wouldn't surprise me. Certainly there will be more ecommerce stores in the future, so we probably can't say "that average number of products in an ecommerce store over all time?"
Anyway, the punchline is, you can't sample this data to find out the mean nor can you find its variance.
The original poster is finding that his residuals follow a power law. Depending on how steep the exponent is, it's possible that there is no well-defined mean for his residuals: as he collects more data, his mean will go up in proportion to the number of data points. If he is defining his loss function in terms of the mean of the residuals (or anything along those lines) then gradient descent is going to have some unresolvable[*] problems. If this is true, gradient descent will take his parameters on an exciting adventure through fractal saddles, where there's always a direction where it can reduce the loss function that makes no improvement to the majority of his data.
This looks to me like what is happening to him.
[*] Unresolvable with the state of the art at the moment AFAICT. I'm going to put this on my PhD research to do list.
Viewing a single comment thread. View all comments