low_effort_shit-post t1_j2ks3qv wrote on January 2, 2023 at 1:39 AM

I'm a data engineer by trade, usually we say no. Pdf isn't a data type or type of storage it is a print format. Go to the source and ask for the source. Once the $$$ is discussed and an understanding of how much harder pdfs are to work with and maintain a process it only makes sense to grab the data from wherever pdf does.

30katz t1_j2lilwg wrote on January 2, 2023 at 5:15 AM

Our company is stuck with PDF’s but it’s actually not too hard to work with using Amazon’s textract or Adobe Extract API. But maybe that’s a sign that it is hard because the technology is owned by the two biggest tech giants in the space.

VacuousWaffle t1_j2m3shr wrote on January 2, 2023 at 9:29 AM

I remember at a hospital job I was asked to mine text from PDFs that were generated internally after another team spent a few months trying to build a solution themselves. The source data they were trying to mine was already in the data warehouse in a surprisingly well-formatted table.

low_effort_shit-post t1_j2mpzz5 wrote on January 2, 2023 at 1:58 PM

We get pdf feeds all the time with promises that have financial implications to get a proper data feed. Usually we kick the can down the road and when we get the feed just pull it in and it moves along out usual etl process within a day. PDFs are to be ignored