Fender6969

Fender6969 OP t1_jcrnppi wrote

Thanks for the response. I think hardcoding things might make the most sense. Ignoring testing the actual data for a minute, let us say I have an ML pipeline with the following units:

  1. Data Engineering: method that queries data, performs further aggregation in Pandas/PySpark
    1. Unit test: hardcode an input to pass into this function and leverage Pytest/unittest to check for the exact output'
  2. Model Training: method that engineers features and passes data into Sklearn pipeline, which scales/encodes data and trains ML model
    1. Unit test: check for successful predictions on training data to a degree of accuracy based on your evaluation metric
  3. Model Serving: first method that performs ETL for prediction data and second method that loads Sklearn pipeline object to serve prediction
    1. Unit test:
      1. Module 1: same as Data Engineering
      2. Module 2: check for successful predictions

Does the above unit tests make sense to add in a CI pipeline?

1