asraniel t1_jcoj075 wrote on March 18, 2023 at 10:27 AM

#2,261,466

i would love to know more about this. some of my ideas are super simple dataset that might very well overfit, but show that the code works. other than that, i would love to hear more about simple tests which do not need to run the full pipeline on the full dataset (which does not tell you that much ultimately)

fullstackai t1_jcokcsq wrote on March 18, 2023 at 10:45 AM

#2,261,523

I treat code artifacts of ML pipelines like any other software. I aim for 100% test coverage. Probably a bit controversial, but I always keep a small amount of example data in the repo for unit and integration tests. Could also be downloaded from blob in the CI pipeline, but repo size is usually not the limiting factor.

-xylon t1_jcokzje wrote on March 18, 2023 at 10:54 AM

#2,261,558

Replying to fullstackai (#2,261,523)

Having a schema and generating random or synthetic data based on that schema is my way to go for testing.

blazejd t1_jcolu6i wrote on March 18, 2023 at 11:05 AM

#2,261,589

I've been trying to figure it out myself and I'm very curious to see other responses.

TheGuywithTehHat t1_jcomptw wrote on March 18, 2023 at 11:16 AM

#2,261,628

Any specific part you're wondering about? General advice applies here: test each unit of your software, and then integrate the units and test them that way. For each unit, hardcode the input and then test that the output is what you expect. For unit tests, make them as simple as possible while still testing as much of the functionality as possible. For integration tests, make a variety of them ranging from just a couple combined units & simple input/output to end-to-end tests that simulate the real world as closely as possible.

This is all advice that's not specific to ML in any way. Anything more specific depends on so many factors that boil down to:

What is your environment like?
What do you expect to change between different runs of the test?

For example: Will your dataset change? Will it change just a little (MNIST to Fashion-MNIST) or a lot (MNIST to CIFAR)? Will your model change? Will it just be a new training run of the same model? Will the model architecture stay the same or will it change internally? Will the input or output format of the model change? How often will any of these changes happen? Which parts of the pipeline are manual, and which are automatic? For each part of the system, what are the consequences of it failing (does it merely block further development, or will you get angry calls from your clients)?

Edit: I think the best advice I can give is to test everything that can possibly be tested, but prioritize based on risk impact (chance_of_failure * consequences_of_failure).

nucLeaRStarcraft t1_jcoo30z wrote on March 18, 2023 at 11:32 AM

#2,261,705

Replying to -xylon (#2,261,558)

more or less the same. However, the simplest way to start, at least that's what I found, is to randomize a sub sample of real data. It may be the case that synthetic data is simply too simple / does not capture the real distribution and can hide bugs.

Probably both is the ideal solution.

gdpoc t1_jcperei wrote on March 18, 2023 at 3:20 PM

#2,263,241

Replying to nucLeaRStarcraft (#2,261,705)

Also depends on privacy constraints, sometimes you can't persist the data.

fleanend t1_jcq5dz7 wrote on March 18, 2023 at 6:21 PM

#2,264,555

Replying to fullstackai (#2,261,523)

I'm glad I'm not the only one

theAbominablySlowMan t1_jcqlq7g wrote on March 18, 2023 at 8:15 PM

#2,265,243

bash a big ole data set through as an integration test and call it a done job. in my experience, DS moves too fast for testing to be as effective as for SWEs (no matter how carefully I've written my tests, they've never lasted more than 12 months before becoming a nuisance that people started ignoring).

Fender6969 OP t1_jcrnppi wrote on March 19, 2023 at 1:01 AM

#2,266,931

Replying to TheGuywithTehHat (#2,261,628)

Thanks for the response. I think hardcoding things might make the most sense. Ignoring testing the actual data for a minute, let us say I have an ML pipeline with the following units:

Data Engineering: method that queries data, performs further aggregation in Pandas/PySpark
1. Unit test: hardcode an input to pass into this function and leverage Pytest/unittest to check for the exact output'
Model Training: method that engineers features and passes data into Sklearn pipeline, which scales/encodes data and trains ML model
1. Unit test: check for successful predictions on training data to a degree of accuracy based on your evaluation metric
Model Serving: first method that performs ETL for prediction data and second method that loads Sklearn pipeline object to serve prediction
1. Unit test:
  1. Module 1: same as Data Engineering
  2. Module 2: check for successful predictions

Does the above unit tests make sense to add in a CI pipeline?

Fender6969 OP t1_jcrnury wrote on March 19, 2023 at 1:02 AM

#2,266,939

Replying to theAbominablySlowMan (#2,265,243)

Yeah I am always uncomfortable pushing untested code to Production. I think I have some good ideas for what to add to my CI pipeline regarding unit tests.

Fender6969 OP t1_jcrnzzg wrote on March 19, 2023 at 1:03 AM

#2,266,942

Replying to gdpoc (#2,263,241)

Many of the clients I support have rather sensitive data and persisting this into a repo would be a security risk. I suppose creating synthetic data would be the next best alternative.

TheGuywithTehHat t1_jcrsjlo wrote on March 19, 2023 at 1:39 AM

#2,267,187

Replying to Fender6969 (#2,266,931)

Most of that makes sense. The only thing I would be concerned about is the model training test. Firstly, a unit test should test the smallest possible unit. You should have many unit tests to test your model, and you should focus on those tests being as simple as possible. Nearly every function you write should have its own unit test, and no unit test should test more than one function. Secondly, there is an important difference between verification and validation testing. Verification testing shouldn't test for any particular accuracy threshold or anything like that, it should at most verify things like "model.fit() causes the model to change" or "a linear regression model that is all zeroes produces an output of zero." Verification testing is what you put on your CI pipeline to sanity check your code before it gets merged to master. Validation testing, however, should test model accuracy. It should go on your CD pipeline, and should validate that the model you're trying to push to production isn't low quality.

Fender6969 OP t1_jcrw759 wrote on March 19, 2023 at 2:08 AM

#2,267,376

Replying to TheGuywithTehHat (#2,267,187)

This makes perfect sense thank you. I’m going to think through this further. If you have any suggestions for verification/sanity testing for any of the components listed above please let me know.

gamerx88 t1_jctp6px wrote on March 19, 2023 at 2:19 PM

#2,270,340

Replying to fullstackai (#2,261,523)

Is there a reason why you feel there is need for such rigour? 100% is quite an overkill even for the typical software projects IMO.

You probably end up having to write tests for even simple one liner functions which gets exhausting.

gamerx88 t1_jctqruk wrote on March 19, 2023 at 2:32 PM

#2,270,418

For ETL, write unit tests to handle some input edge cases. E.g Null values, mis-formatting, values out of range as well as some simple working cases.

For model training, the test focus is on having "valid" hyperparams and configurations. I write test cases to try to overfit on a small training set. i.e Confirm the model learns. There are also some robustness tests that I sometimes run post training, but those are very specific to certain NLP tasks, applications.

For model serving, successful parsing of the request and subsequent feature transformation (if any), very similar to ETL.

fullstackai t1_jcttgyd wrote on March 19, 2023 at 2:51 PM

#2,270,554

Replying to gamerx88 (#2,270,340)

Should have been more precise. 100% of what goes into any pipeline or the deployment gets tested. We deploy many models on the edge in manufacturing. If the model fails, the production line might stand still. Can't risk that.

gamerx88 t1_jcx0t9r wrote on March 20, 2023 at 5:21 AM

#2,277,127

Replying to fullstackai (#2,270,554)

Ah, that makes sense.

[D] Unit and Integration Testing for ML Pipelines

Comments

asraniel t1_jcoj075 wrote on March 18, 2023 at 10:27 AM

fullstackai t1_jcokcsq wrote on March 18, 2023 at 10:45 AM

-xylon t1_jcokzje wrote on March 18, 2023 at 10:54 AM

blazejd t1_jcolu6i wrote on March 18, 2023 at 11:05 AM

TheGuywithTehHat t1_jcomptw wrote on March 18, 2023 at 11:16 AM

nucLeaRStarcraft t1_jcoo30z wrote on March 18, 2023 at 11:32 AM

gdpoc t1_jcperei wrote on March 18, 2023 at 3:20 PM

fleanend t1_jcq5dz7 wrote on March 18, 2023 at 6:21 PM

theAbominablySlowMan t1_jcqlq7g wrote on March 18, 2023 at 8:15 PM

Fender6969 OP t1_jcrnppi wrote on March 19, 2023 at 1:01 AM

Fender6969 OP t1_jcrnury wrote on March 19, 2023 at 1:02 AM

Fender6969 OP t1_jcrnzzg wrote on March 19, 2023 at 1:03 AM

TheGuywithTehHat t1_jcrsjlo wrote on March 19, 2023 at 1:39 AM

Fender6969 OP t1_jcrw759 wrote on March 19, 2023 at 2:08 AM

gamerx88 t1_jctp6px wrote on March 19, 2023 at 2:19 PM

gamerx88 t1_jctqruk wrote on March 19, 2023 at 2:32 PM

fullstackai t1_jcttgyd wrote on March 19, 2023 at 2:51 PM

gamerx88 t1_jcx0t9r wrote on March 20, 2023 at 5:21 AM