Submitted by hopedallas t3_zmaobm in MachineLearning
I am working on a problem where the negative/0 label to postie/1 label ratio is 180MM/10MM. The data size is around 25GB and I have >500 features. Certainly, I don't want to use all 180MM rows of majority class to train my model due to computational limitations. Currently, I simply perform an under-sampling from majority class. However, I have been reading that this may cause loss of the useful information or cause difficulties for determining the decision boundary between the classes (see https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-classification/). When I do the under-sampling, I try to make sure that distribution of my data stays the same. I am wondering if there is a better way to handle this?
JackandFred t1_j0a1gv2 wrote
If you think the real world data will be similar to your samples it's fine. But that's unlikely if you got this dataset that's so skewed. Loos up alternative metrics like F score etc. so that you can try to scale what's important metrics when training (false positive vs false negative etc.)
what you linked there is algorithms for imbalanced classification, usually the same algorithm is fine, but you want a different loss metric.