Fender6969
Fender6969 OP t1_jcrnzzg wrote
Reply to comment by gdpoc in [D] Unit and Integration Testing for ML Pipelines by Fender6969
Many of the clients I support have rather sensitive data and persisting this into a repo would be a security risk. I suppose creating synthetic data would be the next best alternative.
Fender6969 OP t1_jcrnury wrote
Reply to comment by theAbominablySlowMan in [D] Unit and Integration Testing for ML Pipelines by Fender6969
Yeah I am always uncomfortable pushing untested code to Production. I think I have some good ideas for what to add to my CI pipeline regarding unit tests.
Fender6969 OP t1_jcrnppi wrote
Reply to comment by TheGuywithTehHat in [D] Unit and Integration Testing for ML Pipelines by Fender6969
Thanks for the response. I think hardcoding things might make the most sense. Ignoring testing the actual data for a minute, let us say I have an ML pipeline with the following units:
- Data Engineering: method that queries data, performs further aggregation in Pandas/PySpark
- Unit test: hardcode an input to pass into this function and leverage Pytest/unittest to check for the exact output'
- Model Training: method that engineers features and passes data into Sklearn pipeline, which scales/encodes data and trains ML model
- Unit test: check for successful predictions on training data to a degree of accuracy based on your evaluation metric
- Model Serving: first method that performs ETL for prediction data and second method that loads Sklearn pipeline object to serve prediction
- Unit test:
- Module 1: same as Data Engineering
- Module 2: check for successful predictions
- Unit test:
Does the above unit tests make sense to add in a CI pipeline?
Fender6969 OP t1_jcrw759 wrote
Reply to comment by TheGuywithTehHat in [D] Unit and Integration Testing for ML Pipelines by Fender6969
This makes perfect sense thank you. I’m going to think through this further. If you have any suggestions for verification/sanity testing for any of the components listed above please let me know.