backtorealite t1_iy7z5lu wrote on November 29, 2022 at 12:49 PM

Reply to comment by ZombieRickyB in [D] What method is state of the art dimensionality reduction by olmec-akeru

But I feel like this is equivalent to a statistician telling me to not trust my XGBoost model with 99% accuracy but is fine with my linear model with 80% accuracy. If it works, it works. Unrealistic model data transformations happen in all types of models and as long as you aren’t just selecting the prettiest picture that you arrived on by chance I see now problem with relying on a unsupervised transformation that may consistent of some unrealistic transformations if it fundamentally is still highly effective in getting what you want. If I know my data has interaction and non linear effects but don’t know which variables will have such effects, it seems like a UMAP or tsne transformation to two dimensions is a perfectly reasonable option and preferable to PCA in that situation. I feel like the problems you describe are mostly solved by just adjusting the parameters and making sure the clusters you find are robust to those alterations.

ZombieRickyB t1_iy94v38 wrote on November 29, 2022 at 6:00 PM

I mean, yeah, you're not wrong. If it works for you, it works for you. It's problem space dependent, and there's virtually no research that exists suggesting how much, if at all, things will be distorted in the process for given scenarios. For my work, I need to have theoretical bounds on conformal/isometric distortion, the distances involved are infinitely more important than classification accuracy. I work in settings where near perfect classification accuracy is absolutely not expected, so well separation of clusters just will call for question of results.

There have been a number of cases, both theoretical and in practice, where t-SNE and UMAP give results with questionable reliability. I'm sure I could get an example in my space with little effort as well, and I'd rather go through some nonlinear transforms I know well in terms of how they work than spend a bunch of time tuning optimization hyperparameters that could take forever.