jimmymvp
jimmymvp t1_j807b94 wrote
There's a bunch of cool work on using constrained optimization as a layer in neural nets, differentiation through argmin. I'm not sure if this answers your question.
jimmymvp t1_j806dx2 wrote
Reply to comment by AdFew4357 in [D] Critique of statistics research from machine learning perspectives (and vice versa)? by fromnighttilldawn
Just communicating what I've heard. Nevertheless, I think the whole interpretable ML community (at the very least) would disagree with you on this one :). Reducing ML to "plug and chug" is well... Speaks for itself :D
jimmymvp t1_j7yubak wrote
Reply to [D] Critique of statistics research from machine learning perspectives (and vice versa)? by fromnighttilldawn
A pretty famous stats professor once told me that he should've switched to ML a long time ago. Now he does ML research, obviously very rigorous. He said that stats is making up questions that are to a large extent not practically useful.
jimmymvp t1_j7oybk9 wrote
Reply to Model/paper ideas: reinforcement learning with a deterministic environment [D] by EmbarrassedFuel
Ok, first off, I'm very curious what's the actual problem that you're solving. Can you describe it a bit more in detail or give a link?
If you have a perfect model that's cheap to compute, you can go with sampling approaches, I don't know how your constraints look like though. If your state/action space is too big, you might want to reduce it somehow by learning an embedding.
Is the model differentiable? I guess it is if you're using a MILP approach.
I guess some combination of MCTS with value function learning is plausible if your search space is big, such as it's done with alpha zero etc. I find the hybrid aspect of it very interesting though. It sounds like if you want to do amortized search, you need to combine MCTS and search in continuous space (sampling). Should be simple enough with a perfect model. Probably some ideas from mu zero would come in handy.
jimmymvp t1_j7afxx4 wrote
Reply to comment by schwagggg in [D] Normalizing Flows in 2023? by wellfriedbeans
You can perfectly well do the reverse KL with diffusion models, see here:
jimmymvp t1_j7aex0t wrote
Reply to comment by PHEEEEELLLLLEEEEP in [D] Normalizing Flows in 2023? by wellfriedbeans
In theory yes, in practice it's not exact, it's approximated via trace estimator and ODE solver.
jimmymvp t1_j7aend6 wrote
Reply to comment by badabummbadabing in [D] Normalizing Flows in 2023? by wellfriedbeans
Indeed, if your model is bad at modeling the data there's not much use in computing the likelihoods. If you want to just sample images that look cool, you don't care that much about likelihoods. However, there are certain use-cases where we care about exact likelihoods, estimating normalizing constants and providing guarantees for MCMC. Granted, you can always run MCMC with something close to a proposal distribution. However, obtaining nice guarantees on convergence and mixing times (correctness??) is difficult then, I don't know how are you supposed to do this when using a proposal for which you can't evaluate the likelihood. Similarly when you talk about importance sampling, you can only obtain correct weights if you have the correct likelihoods, otherwise it's approximate, not just in the model but also in the estimator.
This is the way I see it at least, but I'll be sure to read the aforementioned paper. I'm also not sure how much having the lower bound hurts you in estimation.
jimmymvp t1_j75qyff wrote
Reply to comment by based_goats in [D] Normalizing Flows in 2023? by wellfriedbeans
Would be interested in that yes
jimmymvp t1_j71cgkw wrote
Reply to comment by badabummbadabing in [D] Normalizing Flows in 2023? by wellfriedbeans
There is a trick how you can get away with gradually expanding your latent dimension with normalising flows, if you assume that the dimensions are independent to a certain point, then you sample from a base distribution and concatenate in the middle of the flow.
Again, MCMC sampling, simulation based inference are examples. Imagine you have an energy function that describes the distribution (you don't have data), how do you sample from this distribution? You would do some MCMC, how would you arrive to a good proposal distribution to make the MCMC algorithm more efficient? You would fit the proposal based on some limited data that you have or inductive biases such as certain invariances etc.
jimmymvp t1_j71bvhf wrote
Reply to comment by based_goats in [D] Normalizing Flows in 2023? by wellfriedbeans
The problem with diffusion from an SDE view is that you still don't have exact likelihoods because you're again not computing the exact Jacobian to make it tractable and you have ODE solving errors. People mostly resolve to Hutchinson trace estimator, otherwise it would be too expensive to compute, so I don't think that diffusion in this way is going to enter the MCMC world anytime soon.
jimmymvp t1_j6w4ezb wrote
Reply to [D] Normalizing Flows in 2023? by wellfriedbeans
Any application where you need exact likelihoods, flows are king. Such is the case for example jf you're learning a sampling distribution for MCMC sampling, estimating normalizing constants (I believe in physics there are a lot of these problems) etc.
jimmymvp t1_j67uru2 wrote
Diffusion models are effectively score-based, there's a connection with the reversal of the forward process being Gaussian and the noise estimate, effectively you're using scores of Gaussians in the reverse process. The time variable is irrelevant in sense of scale, the discrete time and continuous time essentially do roughly the same, the difference is that one is tied to a specific discretization of the SDE and the other can be solved to arbitrary precision, it's also a difference if you take steps wrt to variance or wrt time. Essentially the continuous formulation should be the limit of the discrete one. So effectively you can take a discrete sampling method and make it a continuous SDE/ODE
jimmymvp t1_j4hy6xm wrote
jimmymvp t1_j4fcjly wrote
Reply to comment by chaosmosis in Why is Super Learning / Stacking used rather rarely in practice? [D] by Worth-Advance-1232
Hm, I'm not sure about that. There's the mixture of experts idea that does not exactly stacking, but rather specializes multiple models to parts of the data so each data point gets assigned to a specific shallow model. What you need then is an assignment rule, mostly done by a classifier and it's been shown that this is cheaper in terms of compute at evaluation time. I'm not sure if the idea is abandoned by now, but Google Brain published a paper on this and there were subsequent works.
jimmymvp t1_j3q74ms wrote
Reply to comment by Mental-Swordfish7129 in [N] What's next for AI? by vsmolyakov
So the active part is the self-predictive part?
jimmymvp t1_j3q6za7 wrote
Reply to comment by eyeswideshhh in [R] Diffusion language models by benanne
I would like for someone to point me to arguments as to why diffusion in latent representation space makes sense (since I already have a generative model with the VAE and I can do Langevin MCMC sampling in the latent). Why should the samples be better in comparison to standard VAE with more sophisticated sampling(MCMC) or just diffusion? i.e. why do I need a double generative model? Is it because it's faster? It seems to me like there should be a better way, but I'm genuinely curious what are the arguments :) (except in this case that we have discrete data, for which there also exist formulations (ex. simplex diffusion)
jimmymvp t1_j3q66hi wrote
Reply to comment by Mental-Swordfish7129 in [N] What's next for AI? by vsmolyakov
I meant more like research papers from top conferences in ML (neurips, iclr, icml)
jimmymvp t1_j3q5wmj wrote
Reply to comment by Mental-Swordfish7129 in [N] What's next for AI? by vsmolyakov
Sry, what's the "active" part here? Is the model actually generative? I'm aware of Karl Friston and the free-energy principle. Is the active part the input stream selection? I thought that the active part refers to learning, in a sense that I get to pick my training data along the way. Sounds like what you're doing is akin to Gato from DeepMind with tokenization and is about multi-modal policies (modulo the hierarchical processing and attention).
Is there a math writeup somewhere?
jimmymvp t1_j3ldd4r wrote
Reply to comment by Mental-Swordfish7129 in [N] What's next for AI? by vsmolyakov
Can you reference some works along these lines? "online unsupervised generative models implementing something akin to the free energy principle and active inference"
jimmymvp t1_j83v503 wrote
Reply to comment by AdFew4357 in [D] Critique of statistics research from machine learning perspectives (and vice versa)? by fromnighttilldawn
I'm not sure if you have a good overview of ML research if this is your claim. Sounds like you've read too many blog posts on transformers. I'd suggest going through some conference proceedings to get a good overview, there's some pretty rigorous (not just stats) stuff out there. I agree though that there is a substantial subset of research in ML that works towards tweaking and pushing the boundaries of what can be achieved with existing methods, which is for me personally exciting to see! A lot of cool stuff came out of scaling up and tweaking the architectures.