You can reduce the number of dimensions prior to clustering using principal component analysis. You'll select the first n components based on how much of the variance you want explained (I usually stop at 95%).
You can borrow from computer graphics and use hierarchical spatial partitioning techniques to speed up clustering/searching. You can use binary space partitioning or k-d trees with hyperplanes. Data points reside in the leaf nodes. If a leaf node reaches a certain density, split it. I haven't tried it, and I'm a little unsure on the geometry, but maybe simple spatial hashing techniques will also work? Then keep track of neighboring spaces and only apply clustering to a subset of the space.
It's admittedly imperfect, but I expect it's a decent approximation at scale.
Chrysomite t1_iu5kcxr wrote
Reply to [D] [R] Large-scale clustering by jesusfbes
I've done this to some extent with PCA-KNN.
You can reduce the number of dimensions prior to clustering using principal component analysis. You'll select the first n components based on how much of the variance you want explained (I usually stop at 95%).
You can borrow from computer graphics and use hierarchical spatial partitioning techniques to speed up clustering/searching. You can use binary space partitioning or k-d trees with hyperplanes. Data points reside in the leaf nodes. If a leaf node reaches a certain density, split it. I haven't tried it, and I'm a little unsure on the geometry, but maybe simple spatial hashing techniques will also work? Then keep track of neighboring spaces and only apply clustering to a subset of the space.
It's admittedly imperfect, but I expect it's a decent approximation at scale.