Submitted by cm_34978 t3_100rbhp in MachineLearning
low_effort_shit-post t1_j2ks3qv wrote
I'm a data engineer by trade, usually we say no. Pdf isn't a data type or type of storage it is a print format. Go to the source and ask for the source. Once the $$$ is discussed and an understanding of how much harder pdfs are to work with and maintain a process it only makes sense to grab the data from wherever pdf does.
30katz t1_j2lilwg wrote
Our company is stuck with PDF’s but it’s actually not too hard to work with using Amazon’s textract or Adobe Extract API. But maybe that’s a sign that it is hard because the technology is owned by the two biggest tech giants in the space.
VacuousWaffle t1_j2m3shr wrote
I remember at a hospital job I was asked to mine text from PDFs that were generated internally after another team spent a few months trying to build a solution themselves. The source data they were trying to mine was already in the data warehouse in a surprisingly well-formatted table.
low_effort_shit-post t1_j2mpzz5 wrote
We get pdf feeds all the time with promises that have financial implications to get a proper data feed. Usually we kick the can down the road and when we get the feed just pull it in and it moves along out usual etl process within a day. PDFs are to be ignored
Viewing a single comment thread. View all comments