Is there any computationally efficient implementation of clustering methods, beyond vanilla kmeans, for lot of data? Let's say 1M data points of 100s dimensions. I know about sci-kit learn and a couple others but I wonder if the will suit that data sizes.

Comments

You must log in or register to comment.

sapnupuasop t1_iu0pqof wrote on October 27, 2022 at 6:13 PM

#263,793

Does clustering on 100s of features make sense? Maybe spark could solve your problems but you must look if there are spark implementations of other algorithms

jesusfbes OP t1_iu0vao5 wrote on October 27, 2022 at 6:49 PM

#264,584

Replying to sapnupuasop (#263,793)

It does make sense, if your data point are of high dimension, say vector embedding for example. I believe that spark is an option, for algorithms that are not implemented as spectral clustering you have the primitives to make it yourself. Thank you for your response.

AvailableIron2403 t1_iu0yqfd wrote on October 27, 2022 at 7:11 PM

#265,121

You can try faiss, a library developed by Facebook for very efficient vector similarity search

sapnupuasop t1_iu0zjy5 wrote on October 27, 2022 at 7:17 PM

#265,241

Replying to jesusfbes (#264,584)

Yeah was thinking of curse of dimensionality, with standard Euclidean distance for example, distances in high dimensions lose their meaning, but there are surely other distance which could function there. Btw I have used sklearn to cluster on couple millions of rows with sklearn successfully

MUSEy69 t1_iu10tpw wrote on October 27, 2022 at 7:25 PM

#265,445

try https://rapids.ai/

SnowFP t1_iu16owi wrote on October 27, 2022 at 8:03 PM

#266,276

Hi, I think Faiss might work. It has GPU support for KMeans clustering as well.

TheLionKing2020 t1_iu1bcw8 wrote on October 27, 2022 at 8:33 PM

#266,949

Well, you don't need to train on all of these data

First take samples of 10k, 50k and 100k and see if you have different results. Do you get different number of clusters?

illtakeboththankyou t1_iu2bavr wrote on October 28, 2022 at 12:56 AM

#272,653

UMAP+HDBSCAN

jesusfbes OP t1_iu3915z wrote on October 28, 2022 at 5:58 AM

#277,752

Replying to TheLionKing2020 (#266,949)

That was an initial idea, probably it is what I would do. However, it is good to now about efficient approaches

TheLionKing2020 t1_iu3g14t wrote on October 28, 2022 at 7:32 AM

#278,619

Replying to jesusfbes (#277,752)

Also before going to make tests over 100k of samples check if you can lower the dimensions: feature selection, low variance, PCA, etc.

PassionatePossum t1_iu3gete wrote on October 28, 2022 at 7:38 AM

#278,673

Replying to sapnupuasop (#263,793)

I have done it before and it works well. But I guess it depends on the use-case. It is a classic technique in computer vision to cluster SIFT vectors (128 dimensions) on a training dataset. You then describe any image as a set of „visual words“ (i.e. the IDs of the clusters its SIFT vectors fall into).

A colleague of mine wrote the clustering algorithm himself. It was just a normal k-means with the nearest neighbor search replaced by an approximate nearest neighbor search to speed things up.

ResponsibilityNo7189 t1_iu3ng2v wrote on October 28, 2022 at 9:21 AM

#279,538

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.mllib.clustering.PowerIterationClustering.html

Chrysomite t1_iu5kcxr wrote on October 28, 2022 at 6:36 PM

#292,700

I've done this to some extent with PCA-KNN.

You can reduce the number of dimensions prior to clustering using principal component analysis. You'll select the first n components based on how much of the variance you want explained (I usually stop at 95%).

You can borrow from computer graphics and use hierarchical spatial partitioning techniques to speed up clustering/searching. You can use binary space partitioning or k-d trees with hyperplanes. Data points reside in the leaf nodes. If a leaf node reaches a certain density, split it. I haven't tried it, and I'm a little unsure on the geometry, but maybe simple spatial hashing techniques will also work? Then keep track of neighboring spaces and only apply clustering to a subset of the space.

It's admittedly imperfect, but I expect it's a decent approximation at scale.

bgighjigftuik t1_iucephe wrote on October 30, 2022 at 7:12 AM

#336,714

Replying to SnowFP (#266,276)

Indeed, Faiss is the fastest he may be able to get; at least in a single machine