Viewing a single comment thread. View all comments

Only_Television2030 t1_isl60kc wrote

I have a list of sentences. Examples:

  1. ${INS1}, Watch our latest webinar about flu vaccine
  2. Do you think patients would like to go up to 250 days without an attack?
  3. Watch our latest webinar about flu vaccine
  4. ??? See if more of your patients are ready for vaccine
  5. Important news for your invaccinated patients
  6. Important news for your inv?ccinated patients
  7. ...
    I have around 30k of sentences, around 85% of these are sentences that considered as 'good'. By good I mean sentences with no strange characters and sequences of characters such as '${INS1}', '???', or '?' inside the word etc. Otherwise sentence is considered as 'bad'. I need to find 'good' patterns to be able to identify 'bad' sentences in the future and exclude them, as the list of sentences will become larger in the future and new 'bad' sentences might appear.
    Is there any way to identify 'good' sentences using Regex, libraries in Python/R, or any other tool?
    Thank you
1

BakerInTheKitchen t1_isoqlm5 wrote

I would think you could probably just use a list of special characters, loop through the sentence, and if the character is in the list, create a binary indicator

1