Submitted by Fender6969 t3_11ujf7d in MachineLearning
-xylon t1_jcokzje wrote
Reply to comment by fullstackai in [D] Unit and Integration Testing for ML Pipelines by Fender6969
Having a schema and generating random or synthetic data based on that schema is my way to go for testing.
nucLeaRStarcraft t1_jcoo30z wrote
more or less the same. However, the simplest way to start, at least that's what I found, is to randomize a sub sample of real data. It may be the case that synthetic data is simply too simple / does not capture the real distribution and can hide bugs.
Probably both is the ideal solution.
gdpoc t1_jcperei wrote
Also depends on privacy constraints, sometimes you can't persist the data.
Fender6969 OP t1_jcrnzzg wrote
Many of the clients I support have rather sensitive data and persisting this into a repo would be a security risk. I suppose creating synthetic data would be the next best alternative.
Viewing a single comment thread. View all comments