-xylon t1_jcokzje wrote on March 18, 2023 at 10:54 AM

Reply to comment by fullstackai in [D] Unit and Integration Testing for ML Pipelines by Fender6969

Having a schema and generating random or synthetic data based on that schema is my way to go for testing.

nucLeaRStarcraft t1_jcoo30z wrote on March 18, 2023 at 11:32 AM

more or less the same. However, the simplest way to start, at least that's what I found, is to randomize a sub sample of real data. It may be the case that synthetic data is simply too simple / does not capture the real distribution and can hide bugs.

Probably both is the ideal solution.

gdpoc t1_jcperei wrote on March 18, 2023 at 3:20 PM

Also depends on privacy constraints, sometimes you can't persist the data.

Fender6969 OP t1_jcrnzzg wrote on March 19, 2023 at 1:03 AM

Many of the clients I support have rather sensitive data and persisting this into a repo would be a security risk. I suppose creating synthetic data would be the next best alternative.