Viewing a single comment thread. View all comments

Numerous-Carrot3910 t1_j5jhhkg wrote

Hi, I’m trying to build a model with a large number of categorical predictor variables that each have a large number of internal categories. Implementing OHE leads to a higher dimensional dataset than I want to work with. Does anyone have advice for dealing with this other than using subject matter expertise or iteration to perform feature selection? Thanks!

1

trnka t1_j5k4ldr wrote

It depends on the data and the problems you're having with high-dimensional data.

  • If the variables are phrases like "acute sinusitis, site not specified" you could use a one hot encoding of ngrams that appear in them.
  • If you have many rare values, you can just retain the top K values per feature.
  • If those don't work, the hashing trick is another great thing to try. It's just not easily interpretable.
  • If there's any internal structure to the categories, like if they're hierarchical in some way, you can cut them off at a higher level in the hierarchy
2

Numerous-Carrot3910 t1_j5ka168 wrote

Thanks for your response! Even with retaining the top K values of each feature, there are still a large number of features to consider. I haven’t tried the hashing trick, so I will look into that

1

trnka t1_j5kksex wrote

Hmm, you might also try feature selection. I'm not sure what you mean by not iterating, unless you mean recursive feature elimination? There are a lot of really fast correlation functions you can try for feature selection -- scikit-learn has some popular options. They run very quickly, and if you have lots of data you can probably do the feature selection part on a random subset of the training data.

Also, you could do things like dimensionality reduction learned from a subset of the training data, whether PCA or a NN approach.

1

Numerous-Carrot3910 t1_j5kuwul wrote

Yes, I was referring to recursive feature elimination. Thanks for the recommendations

1