Fender6969 OP t1_jcrw759 wrote on March 19, 2023 at 2:08 AM

Reply to comment by TheGuywithTehHat in [D] Unit and Integration Testing for ML Pipelines by Fender6969

This makes perfect sense thank you. I’m going to think through this further. If you have any suggestions for verification/sanity testing for any of the components listed above please let me know.

Fender6969 OP t1_jcrnzzg wrote on March 19, 2023 at 1:03 AM

Reply to comment by gdpoc in [D] Unit and Integration Testing for ML Pipelines by Fender6969

Many of the clients I support have rather sensitive data and persisting this into a repo would be a security risk. I suppose creating synthetic data would be the next best alternative.

Fender6969 OP t1_jcrnury wrote on March 19, 2023 at 1:02 AM

Reply to comment by theAbominablySlowMan in [D] Unit and Integration Testing for ML Pipelines by Fender6969

Yeah I am always uncomfortable pushing untested code to Production. I think I have some good ideas for what to add to my CI pipeline regarding unit tests.

Fender6969 OP t1_jcrnppi wrote on March 19, 2023 at 1:01 AM

Reply to comment by TheGuywithTehHat in [D] Unit and Integration Testing for ML Pipelines by Fender6969

Thanks for the response. I think hardcoding things might make the most sense. Ignoring testing the actual data for a minute, let us say I have an ML pipeline with the following units:

Data Engineering: method that queries data, performs further aggregation in Pandas/PySpark
1. Unit test: hardcode an input to pass into this function and leverage Pytest/unittest to check for the exact output'
Model Training: method that engineers features and passes data into Sklearn pipeline, which scales/encodes data and trains ML model
1. Unit test: check for successful predictions on training data to a degree of accuracy based on your evaluation metric
Model Serving: first method that performs ETL for prediction data and second method that loads Sklearn pipeline object to serve prediction
1. Unit test:
  1. Module 1: same as Data Engineering
  2. Module 2: check for successful predictions

Does the above unit tests make sense to add in a CI pipeline?