bankCC t1_ix9o1ln wrote on November 21, 2022 at 8:28 PM

Which approach would be best for a classification of text into 2 categories, where my dataset is realy small and unbalanced (4000, 250) each text containing around 200-300 words.

And most of the time just one or two words will lead to classification. I could just do a keyword search, but misspelled words might slip through and the dictionary would be pretty big and computational expensive to compare on each file. So I thought ML would be a better idea.

Maybe a CNN but the dataset seems to be way too small to accomplish acceptable results.

Any hints are welcome tyvm

Gazorpazzor t1_ixc37ng wrote on November 22, 2022 at 9:14 AM

Hello,

Extract Features using "TF-IDF" (If the classification is likely led by few specific words)
Train an SVM classifier ( In your case, with few data samples, I would train different classifiers with different hyperparameters and keep the best model. NN architectures like GRUs and LSTMs give decent results, unfortunately they might need more data to produce good results)
Increase your iteration / epochs to compensate for the really small dataset size (keep and eye on the evaluation set loss to prevent overfitting)

As for the data imbalance problem, I would try with undersampling the 4000 samples class set to 250 samples first, then try to improve results later on by data augmentation or cost sensitive algorithms ( cost-sensitive SVM, weighted cross-entropy,...)

bankCC t1_ixc6lk0 wrote on November 22, 2022 at 10:03 AM

Thank you very much for the answer! I highly appreciate it. You gave me a realy good base to start from. Huge thanks