Submitted by AutoModerator t3_10cn8pw in MachineLearning
Numerous-Carrot3910 t1_j5jhhkg wrote
Hi, I’m trying to build a model with a large number of categorical predictor variables that each have a large number of internal categories. Implementing OHE leads to a higher dimensional dataset than I want to work with. Does anyone have advice for dealing with this other than using subject matter expertise or iteration to perform feature selection? Thanks!
trnka t1_j5k4ldr wrote
It depends on the data and the problems you're having with high-dimensional data.
- If the variables are phrases like "acute sinusitis, site not specified" you could use a one hot encoding of ngrams that appear in them.
- If you have many rare values, you can just retain the top K values per feature.
- If those don't work, the hashing trick is another great thing to try. It's just not easily interpretable.
- If there's any internal structure to the categories, like if they're hierarchical in some way, you can cut them off at a higher level in the hierarchy
Numerous-Carrot3910 t1_j5ka168 wrote
Thanks for your response! Even with retaining the top K values of each feature, there are still a large number of features to consider. I haven’t tried the hashing trick, so I will look into that
trnka t1_j5kksex wrote
Hmm, you might also try feature selection. I'm not sure what you mean by not iterating, unless you mean recursive feature elimination? There are a lot of really fast correlation functions you can try for feature selection -- scikit-learn has some popular options. They run very quickly, and if you have lots of data you can probably do the feature selection part on a random subset of the training data.
Also, you could do things like dimensionality reduction learned from a subset of the training data, whether PCA or a NN approach.
Numerous-Carrot3910 t1_j5kuwul wrote
Yes, I was referring to recursive feature elimination. Thanks for the recommendations
Viewing a single comment thread. View all comments