ZombieRickyB t1_iy6md04 wrote on November 29, 2022 at 3:14 AM

Reply to comment by backtorealite in [D] What method is state of the art dimensionality reduction by olmec-akeru

Yep, same idea. You're optimizing a function, there is zero guarantees on any particular distance based distortion. You can use it for visualization but little else

backtorealite t1_iy6uivr wrote on November 29, 2022 at 4:22 AM

But saying it’s good for visualization is equivalent to saying it’s good for decomposing data into a 2D framework. So either it’s completely useless and shouldn’t be used for visualization or has some utility in downstream analysis. Doesn’t really make sense to say both. And we all know it’s not completely useless so I think it’s a bit unfair to say it should only ever be used for visualization.

ZombieRickyB t1_iy70dqa wrote on November 29, 2022 at 5:19 AM

Visualization does not mean it's good for working in. The nonisometric nature is the killer. Your space is fundamentally distorted, what you input is not necessarily reflected in the visualizations.

Here is a stackexchange thread discussing the main issue (same holds for UMAP) but I'll highlight a different example: https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne

Probably the most famous example of showing why you really can't assume meaning in general comes from the stereographic projection. This is a diffeomorphism between the sphere without a point and the plane. Points close by on the punctured sphere, specifically around the puncture, absolutely need not be close on the plane; they'll actually be quite far apart. The minute you start working with diffeomorphisms of arbitrarily high conformal distortion, the worse you get your results. Is there data that is reasonably "sphere-valued"? Absolutely, I see it a bunch. Any attempt to flatten it will destroy the geometry in some fashion. This is just one example.

You have two things with these setups that put you at risk: you're taking a minimum of some sort, which never needs to be a smooth mapping, and generally a log of something appears somewhere, which fundamentally changes geometry. It's there, it's just usually hidden because someone at some point is assuming something is approximately Gaussian. That's why the energies look the way they do.

olmec-akeru OP t1_iy7bv3u wrote on November 29, 2022 at 7:36 AM

You can resolve the isometric constraint by using a local distance metric dependent on the local curvature: hint, look at the Riemann curvature tensor.

backtorealite t1_iy7z5lu wrote on November 29, 2022 at 12:49 PM

But I feel like this is equivalent to a statistician telling me to not trust my XGBoost model with 99% accuracy but is fine with my linear model with 80% accuracy. If it works, it works. Unrealistic model data transformations happen in all types of models and as long as you aren’t just selecting the prettiest picture that you arrived on by chance I see now problem with relying on a unsupervised transformation that may consistent of some unrealistic transformations if it fundamentally is still highly effective in getting what you want. If I know my data has interaction and non linear effects but don’t know which variables will have such effects, it seems like a UMAP or tsne transformation to two dimensions is a perfectly reasonable option and preferable to PCA in that situation. I feel like the problems you describe are mostly solved by just adjusting the parameters and making sure the clusters you find are robust to those alterations.

ZombieRickyB t1_iy94v38 wrote on November 29, 2022 at 6:00 PM

I mean, yeah, you're not wrong. If it works for you, it works for you. It's problem space dependent, and there's virtually no research that exists suggesting how much, if at all, things will be distorted in the process for given scenarios. For my work, I need to have theoretical bounds on conformal/isometric distortion, the distances involved are infinitely more important than classification accuracy. I work in settings where near perfect classification accuracy is absolutely not expected, so well separation of clusters just will call for question of results.

There have been a number of cases, both theoretical and in practice, where t-SNE and UMAP give results with questionable reliability. I'm sure I could get an example in my space with little effort as well, and I'd rather go through some nonlinear transforms I know well in terms of how they work than spend a bunch of time tuning optimization hyperparameters that could take forever.

trutheality t1_iy953rr wrote on November 29, 2022 at 6:01 PM

It's "good" for visualization in the sense that it can give you something to look at, but it's not really good for visualization. You can't even guarantee that the nearest neighbor of a point in the projection is its nearest neighbor in the input space.

This paper demonstrates that you can make the output look like anything you want and still minimize the UMAP/tSNE objectives: https://www.biorxiv.org/content/10.1101/2021.08.25.457696v3

resented_ape t1_iybprpj wrote on November 30, 2022 at 5:07 AM

FWIW, I don't think that is what the paper demonstrates. The Picasso method the authors introduce uses a totally different cost function based on distance reconstruction. For a specific set of metrics the authors are interested in, they say that Picasso produces results comparable with UMAP and t-SNE. But it's not the UMAP or t-SNE objective.

With the scRNAseq dataset in the python notebook at the github page for picasso, I found that for the metrics one might usually be interested in with UMAP and t-SNE, e.g. neighborhood preservation (what proportion of the k-nearest neighbors in the input and output space are preserved) or Spearman rank correlation (or triplet ordering preservation) of input vs output distances, Picasso did (quite a bit) worse than UMAP and t-SNE.

This might not be relevant to for downstream scRNAseq workflows -- I will take the authors' word on that. At any rate, on my machine Picasso runs very slowly, and I found its own output to be visually unsatisfactory with the default settings with other datasets that I tried it with (e.g. MNIST), so I have been unable to generate a similar analysis for a wide range of datasets. So take that for what it's worth.