jimmymvp

jimmymvp t1_j83v503 wrote

I'm not sure if you have a good overview of ML research if this is your claim. Sounds like you've read too many blog posts on transformers. I'd suggest going through some conference proceedings to get a good overview, there's some pretty rigorous (not just stats) stuff out there. I agree though that there is a substantial subset of research in ML that works towards tweaking and pushing the boundaries of what can be achieved with existing methods, which is for me personally exciting to see! A lot of cool stuff came out of scaling up and tweaking the architectures.

2

jimmymvp t1_j7oybk9 wrote

Ok, first off, I'm very curious what's the actual problem that you're solving. Can you describe it a bit more in detail or give a link?

If you have a perfect model that's cheap to compute, you can go with sampling approaches, I don't know how your constraints look like though. If your state/action space is too big, you might want to reduce it somehow by learning an embedding.

Is the model differentiable? I guess it is if you're using a MILP approach.

I guess some combination of MCTS with value function learning is plausible if your search space is big, such as it's done with alpha zero etc. I find the hybrid aspect of it very interesting though. It sounds like if you want to do amortized search, you need to combine MCTS and search in continuous space (sampling). Should be simple enough with a perfect model. Probably some ideas from mu zero would come in handy.

1

jimmymvp t1_j7aend6 wrote

Indeed, if your model is bad at modeling the data there's not much use in computing the likelihoods. If you want to just sample images that look cool, you don't care that much about likelihoods. However, there are certain use-cases where we care about exact likelihoods, estimating normalizing constants and providing guarantees for MCMC. Granted, you can always run MCMC with something close to a proposal distribution. However, obtaining nice guarantees on convergence and mixing times (correctness??) is difficult then, I don't know how are you supposed to do this when using a proposal for which you can't evaluate the likelihood. Similarly when you talk about importance sampling, you can only obtain correct weights if you have the correct likelihoods, otherwise it's approximate, not just in the model but also in the estimator.

This is the way I see it at least, but I'll be sure to read the aforementioned paper. I'm also not sure how much having the lower bound hurts you in estimation.

2

jimmymvp t1_j71cgkw wrote

There is a trick how you can get away with gradually expanding your latent dimension with normalising flows, if you assume that the dimensions are independent to a certain point, then you sample from a base distribution and concatenate in the middle of the flow.

Again, MCMC sampling, simulation based inference are examples. Imagine you have an energy function that describes the distribution (you don't have data), how do you sample from this distribution? You would do some MCMC, how would you arrive to a good proposal distribution to make the MCMC algorithm more efficient? You would fit the proposal based on some limited data that you have or inductive biases such as certain invariances etc.

3

jimmymvp t1_j71bvhf wrote

The problem with diffusion from an SDE view is that you still don't have exact likelihoods because you're again not computing the exact Jacobian to make it tractable and you have ODE solving errors. People mostly resolve to Hutchinson trace estimator, otherwise it would be too expensive to compute, so I don't think that diffusion in this way is going to enter the MCMC world anytime soon.

1

jimmymvp t1_j6w4ezb wrote

Any application where you need exact likelihoods, flows are king. Such is the case for example jf you're learning a sampling distribution for MCMC sampling, estimating normalizing constants (I believe in physics there are a lot of these problems) etc.

9

jimmymvp t1_j67uru2 wrote

Diffusion models are effectively score-based, there's a connection with the reversal of the forward process being Gaussian and the noise estimate, effectively you're using scores of Gaussians in the reverse process. The time variable is irrelevant in sense of scale, the discrete time and continuous time essentially do roughly the same, the difference is that one is tied to a specific discretization of the SDE and the other can be solved to arbitrary precision, it's also a difference if you take steps wrt to variance or wrt time. Essentially the continuous formulation should be the limit of the discrete one. So effectively you can take a discrete sampling method and make it a continuous SDE/ODE

2

jimmymvp t1_j4fcjly wrote

Hm, I'm not sure about that. There's the mixture of experts idea that does not exactly stacking, but rather specializes multiple models to parts of the data so each data point gets assigned to a specific shallow model. What you need then is an assignment rule, mostly done by a classifier and it's been shown that this is cheaper in terms of compute at evaluation time. I'm not sure if the idea is abandoned by now, but Google Brain published a paper on this and there were subsequent works.

2

jimmymvp t1_j3q6za7 wrote

I would like for someone to point me to arguments as to why diffusion in latent representation space makes sense (since I already have a generative model with the VAE and I can do Langevin MCMC sampling in the latent). Why should the samples be better in comparison to standard VAE with more sophisticated sampling(MCMC) or just diffusion? i.e. why do I need a double generative model? Is it because it's faster? It seems to me like there should be a better way, but I'm genuinely curious what are the arguments :) (except in this case that we have discrete data, for which there also exist formulations (ex. simplex diffusion)

3

jimmymvp t1_j3q5wmj wrote

Sry, what's the "active" part here? Is the model actually generative? I'm aware of Karl Friston and the free-energy principle. Is the active part the input stream selection? I thought that the active part refers to learning, in a sense that I get to pick my training data along the way. Sounds like what you're doing is akin to Gato from DeepMind with tokenization and is about multi-modal policies (modulo the hierarchical processing and attention).

Is there a math writeup somewhere?

1