Numerous-Carrot3910 t1_j5jhhkg wrote on January 23, 2023 at 1:20 PM

Hi, I’m trying to build a model with a large number of categorical predictor variables that each have a large number of internal categories. Implementing OHE leads to a higher dimensional dataset than I want to work with. Does anyone have advice for dealing with this other than using subject matter expertise or iteration to perform feature selection? Thanks!

trnka t1_j5k4ldr wrote on January 23, 2023 at 4:10 PM

It depends on the data and the problems you're having with high-dimensional data.

If the variables are phrases like "acute sinusitis, site not specified" you could use a one hot encoding of ngrams that appear in them.
If you have many rare values, you can just retain the top K values per feature.
If those don't work, the hashing trick is another great thing to try. It's just not easily interpretable.
If there's any internal structure to the categories, like if they're hierarchical in some way, you can cut them off at a higher level in the hierarchy

Numerous-Carrot3910 t1_j5ka168 wrote on January 23, 2023 at 4:45 PM

Thanks for your response! Even with retaining the top K values of each feature, there are still a large number of features to consider. I haven’t tried the hashing trick, so I will look into that

trnka t1_j5kksex wrote on January 23, 2023 at 5:51 PM

Hmm, you might also try feature selection. I'm not sure what you mean by not iterating, unless you mean recursive feature elimination? There are a lot of really fast correlation functions you can try for feature selection -- scikit-learn has some popular options. They run very quickly, and if you have lots of data you can probably do the feature selection part on a random subset of the training data.

Also, you could do things like dimensionality reduction learned from a subset of the training data, whether PCA or a NN approach.

Numerous-Carrot3910 t1_j5kuwul wrote on January 23, 2023 at 6:54 PM

Yes, I was referring to recursive feature elimination. Thanks for the recommendations