Viewing a single comment thread. View all comments

bankCC t1_ix9o1ln wrote

Which approach would be best for a classification of text into 2 categories, where my dataset is realy small and unbalanced (4000, 250) each text containing around 200-300 words.

And most of the time just one or two words will lead to classification. I could just do a keyword search, but misspelled words might slip through and the dictionary would be pretty big and computational expensive to compare on each file. So I thought ML would be a better idea.

Maybe a CNN but the dataset seems to be way too small to accomplish acceptable results.

Any hints are welcome tyvm

1

Gazorpazzor t1_ixc37ng wrote

Hello,

  1. Extract Features using "TF-IDF" (If the classification is likely led by few specific words)
  2. Train an SVM classifier ( In your case, with few data samples, I would train different classifiers with different hyperparameters and keep the best model. NN architectures like GRUs and LSTMs give decent results, unfortunately they might need more data to produce good results)
  3. Increase your iteration / epochs to compensate for the really small dataset size (keep and eye on the evaluation set loss to prevent overfitting)

As for the data imbalance problem, I would try with undersampling the 4000 samples class set to 250 samples first, then try to improve results later on by data augmentation or cost sensitive algorithms ( cost-sensitive SVM, weighted cross-entropy,...)

2

bankCC t1_ixc6lk0 wrote

Thank you very much for the answer! I highly appreciate it. You gave me a realy good base to start from. Huge thanks

2