Submitted by cm_34978 t3_100rbhp in MachineLearning
cm_34978 OP t1_j2n0cym wrote
Update for the interested - after trying a few different packages suggested in the comments, I settled on the inelegant, yet functional solution of automating the import of PDFs to Microsoft Word, saving the PDF as a Word file, then using a library to extract only the body text from the Word file.
Definitely not ideal since this will not work on Linux and will only run as fast as Microsoft Word can open, convert, and save them. But it works.
ypanagis t1_j2nkyk0 wrote
I was about to propose the same. For those who are interested, this seems to work for MacOS, too, but Windows is definitely a goto. A VBA script can also come in handy, for someone to get several PDFs, open them from Word and save as TXT.
cm_34978 OP t1_j2nsi8g wrote
Definitely. With windows, you get the advantage of the win32com library whereas with MacOS, you need need to play with AppleScript, which (in my hands) can be brittle and finicky.
Viewing a single comment thread. View all comments