bloc97
bloc97 t1_j49ft0g wrote
Reply to comment by mugbrushteeth in [D] Bitter lesson 2.0? by Tea_Pearce
My bet is on "mortal computers" (term coined by Hinton). Our current methods to train Deep Nets are extremely inefficient. CPU and GPUs basically have to load data, process it, then save it back to memory. We can eliminate this bandwidth limitation by printing basically a very large differentiable memory cell, with hardware connections inside representing the connections between neurons, which will allow us to do inference or backprop in a single step.
bloc97 t1_j2s05hy wrote
It's curious that a 40% pruning of OPT-175 decreases perplexity, but the same effect is not seen in BLOOM... Could be a fluke but might warrant further investigation.
bloc97 t1_j2q0aio wrote
Reply to comment by Agreeable-Run-9152 in [R] On Time Embeddings in Diffusion models by Agreeable-Run-9152
I'm not too familiar with FNOs, but I guess you could start experimenting by adding the time embeddings to the "DC component" of the fourier transform, it would be at least equivalent to adding the time embeddings to the entire feature in a ResNet.
bloc97 t1_j2pz7r6 wrote
Reply to comment by Agreeable-Run-9152 in [R] On Time Embeddings in Diffusion models by Agreeable-Run-9152
FNO? Are you referring to Fourier Neural Operator?
bloc97 t1_j2pj1c6 wrote
There are many ways to condition a diffusion model using time, but concatenating it as input is the least efficient method because:
- The first layer of your model is a convolutional layer, applying a convolution on a "time" image that has the same value everywhere is not computationally efficient. Early conv layers exist to detect variations in an image (eg. texture), applying the same kernel over and over on an empty image is not efficient.
- By giving t only to the first layer, the network will need to waste resources/neurons to propagate that information through the network. Again, this waste is compounded by the fact that you will need to propagate the time information for every "pixel" in each convolutional feature of your network (because it is a ConvNet). Why not just skip all that and directly give the time embedding to deeper layers within the network?
bloc97 t1_j2pfp6l wrote
GANs are generative models, you want a discriminative model (for regression?). You could start by predicting keypoints similar to the task of pose estimation, but in your case, you could predict 3D coordinates for the four corners of the QR code, plus two points to determine the axis of the cylinder. Then you can easily remove the distortion by inverting the cylindrical projection.
bloc97 t1_iy57elh wrote
Reply to [D] Training LLMs collaboratively by dogonix
With traditional gradient descent, probably not as most operators in modern NN architectures are bottlenecked by bandwidth and not much by compute. There's active research on alternative training methods but they mostly have difficulties in detecting malicious agents in the training pool. If you own all of the machines those algorithms work but when there is a non insignificant amount of malicious agents the training might fail or even produce models that have backdoors or have private training data leaked.
bloc97 t1_ixjuivv wrote
Reply to [R] Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning - Epochai Pablo Villalobos et al - Trend of ever-growing ML models might slow down if data efficiency is not drastically improved! by Singularian2501
This can be considered good news. If all data is exhausted people will be actually forced to research better data-efficient algorithms. We humans don't ingest 100 GBs of arXiv papers to do research and we don't need billions of images to paint a cat sitting on a sofa. Until we figure out how to run GPT-3 on smartphones (maybe using neuromorphic computing?), we shouldn't be too worried about the trend of using bigger and bigger datasets, because small(er) networks can be successfully trained without that much data.
bloc97 t1_iwyeh1x wrote
Reply to comment by pilooch in [D] My embarrassing trouble with inverting a GAN generator. Do GAN questions still get answered? ;-) by _Ruffy_
>the GAN latent space is too compressed/folded
I remember reading a paper that showed that GANs often folds many dimensions of the "internal" latent space into singularities, with large swathes of flat space between them (it's related to the mode collapse problem of GANs).
Back to the question, I guess that when OP is trying to invert the GAN using gradient descent, he is probably getting stuck in a local minima. Try a global search metaheuristic on top of the gradient descent like simulated annealing or genetic algorithms?
bloc97 t1_ivqgf0q wrote
Reply to comment by [deleted] in [Discussion] Could someone explain the math behind the number of distinct images that can be generated with a latent diffusion model? by [deleted]
I was considering an unconditional latent diffusion model, but for conditional models, the computation becomes much more complex (we might have to use bayes here). If we use Score-Based Generative Modeling (https://arxiv.org/abs/2011.13456), we could try to find and count all the unique local minima and saddle points, but it is not clear how we can do this...
bloc97 t1_ivpzu4j wrote
Reply to [Discussion] Could someone explain the math behind the number of distinct images that can be generated with a latent diffusion model? by [deleted]
Theoretically, the upper bound of distinct images is proportional to the number of bits required to encode each latent, thus a 64x64x4 latent encoded as a 32-bit number would amount to (2^32)^(64x64x4) images. However, many of those combinations are not considered to be "images" (they are "out of distribution"), thus the real number might be much much smaller than this, depending on the dataset and the network size.
bloc97 t1_ivpper1 wrote
Reply to comment by CPOOCPOS in [D] Is there an advantage in learning when taking the average Gradient compared to the Gradient of just one point by CPOOCPOS
I mean having the divergence would definitively help, as we will have additional information about the shape of the parameter landscape with respect to the loss function. The general idea would be to prefer areas with negative divergence, while trying to move and search through zero divergence areas very quickly.
Edit: In a sense, using the gradient alone only gives us information about the shape of the loss function at a single point, while having a laplacian gives us a larger "field of view" on the landscape.
bloc97 t1_ivphtmz wrote
Reply to [D] Is there an advantage in learning when taking the average Gradient compared to the Gradient of just one point by CPOOCPOS
Instead of the average, would it be possible to compute the divergence (or laplacian) very quickly? That might lead to a faster higher order optimization method compared to simple gradient descent.
bloc97 t1_ivbmwbb wrote
Reply to comment by king_of_walrus in [D] Has anyone tried coding latent diffusion from scratch? or tried other conditioning information aside from image classes and text? by yamakeeen
I'm just guessing, but it's probably pairs of visual cortex activations with images seen by an animal (maybe mice)...
bloc97 t1_j63q1nk wrote
Reply to comment by HateRedditCantQuitit in [D] Why are GANs worse than (Latent) Diffusion Models for text2img generation? by TheCockatoo
>It's simpler (which leads to progress)
I wouldn't say current diffusions models are simpler, in fact they are much more complex than even the most "complex" GAN architectures. However it's exactly because of all the other points that they have become this complex. A vanilla GAN would never be able to endure this much tweaking without mode collapse. Compare that to even the most basic score-based models, which are always stable.
Sometimes, the "It just works™" proposition is much more appealing than pipeline simplicity or speed.