HandsomeMLE t1_iyz0511 wrote on December 5, 2022 at 5:52 AM

I've finished training a model, but I'm not confident about how to test or prepare it against unexpected risks in terms of trustworthiness and reliability when deployed. Are there some kinds of rules of thumb or any recommended methods to thoroughly test a model against those unseen risks?

trnka t1_iz0722k wrote on December 5, 2022 at 2:37 PM

If possible, find some beta testers. If you're in industry try to find some non-technical folks internally. Don't tell them how to use it, just observe. That will often uncover types of inputs you might not have tested, and can become test cases.

Also, look into monitoring in production. Much like regular software engineering, it's hard to prevent all defects. But some defects are easy to observe by monitoring, like changes in the types of inputs you're seeing over time.

If you're relationship-oriented, definitely make friends with users if possible or people that study user feedback and data, so that they pass feedback along more readily.

HandsomeMLE t1_iz45m9t wrote on December 6, 2022 at 9:42 AM

Many thanks for your answer! I'll definitely do that. I'm also wondering if there are some kind of tools, services, or even methodologies that help pre-screen potential model defects or that catch unexpected reliability issues the model might have, so I can improve the model quality and accuracy with various methods.

trnka t1_iz4nfux wrote on December 6, 2022 at 1:23 PM

Depends on the kind of model. Some examples:

For classification, a confusion matrix is a great way to find issues
For images of people, there's a good amount of work to detect and correct racial bias (probably there are tools to help too)
It can be helpful to use explainability tools like lime or shap -- sometimes that will help you figure out that the model is sensitive to some unimportant inputs and not sensitive enough to important features
Just reviewing errors or poor predictions on held-out data will help you spot some issues.
For time-series, even just looking at graphs of predictions vs actuals on held-out data can help you discover issues
For text input, plot metrics vs text length to see if it does much worse with short texts or long texts
For text input, you can try typos or different capitalization. If it's a language with accents, try inputs that don't have proper accents

I wish I had some tool or service recommendations. I'm sure they exist, but the methods to use are generally specific to the input type of the model (text, image, tabular, time-series) and/or the output of the model (classification, regression, etc). I haven't seen a single tool or service that works for everything.

For hyperparameter tuning even libraries like scikit-learn are great for running it. At my last job I wrote some code to run statistical tests assuming that each hyperparam affected the metric independently and that helped a ton, then did various correlation plots. Generally it's good to check that you haven't made any big mistakes with hyperparams (like if the best value is the min or max of the ones you tried, you can probably try a wider range).

Some of the other issues that come to mind in deployment:

We had one pytorch model that would occasionally have a latency spike (like <0.1% of the time). We never figured out why, except that the profiler said it was in happening inside of pytorch.
We had some issues with unicode input -- the upstream service was sending us latin-1 but we thought it was utf8. We'd tested Chinese input and it didn't crash because the upstream just dropped those chars, but then crashed with Spanish input
At one point the model was using like 99% of the memory of the instance, and there must've been a memory leak somewhere cause after 1-3 weeks it'd reboot. It was easy enough to increase memory though
One time we had an issue where someone checked in a model different than the evaluation report

HandsomeMLE t1_iz95iqy wrote on December 7, 2022 at 11:52 AM

Thank you very much for your detailed explanation, trnka. It's been really helpful! It seems inevitable to have lots of unexplained issues in the process and I guess we can't expect to be perfect all at once :)

How would you weigh the importance of validating/testing a model? (maybe it depends on sector/industry?) As a beginner, I hope I'm not putting too much time and effort into it than I should be.

trnka t1_iz9j30k wrote on December 7, 2022 at 2:02 PM

It definitely depends on sector/industry and also the use case for the model. For example, if you're building a machine learning model that might influence medical decisions, you should put more time into validating it before anyone uses it. And even then, you need to proceed very cautiously in rolling it out and be ready to roll it back.

If it's a model for a small-scale shopping recommendation system, the harm from launching a bad model is probably much lower, especially if you can revert a bad model quickly.

To answer the question about the importance of validating, importance is relative to all the other things you could be doing. It's also about dealing with the unknown -- you don't really know if additional effort in validation will uncover any new issues. I generally like to list out all the different risks of the model, feature, and product. And try to guesstimate the amount of risk to the user, the business, the team, and myself. And then I list out a few things I could do to reduce risk in those areas, then pick work that I think is a good cost-benefit tradeoff.

There's also a spectrum regarding how much people plan ahead in projects:

Planning-heavy: Spend months anticipating every possible failure that could happen. Sometimes people call this waterfall.
Planning-light: Just ship something, see what the feedback is, and fix it. The emphasis here is on a rapid iteration cycle from feedback rather than planning ahead. Sometimes people call this agile, sometimes people say "move fast and break things"

Planning-heavy workflows often waste time on unneeded things, and fail to fix user feedback quickly. Planning-light workflows often make mistakes on their first version that were knowable, and can sometimes permanently lose user trust. I tend to lean planning-light, but there is definite value in doing some planning upfront so long as it's aligned with the users and the business.

In your case, it's a spectrum of how much you test ahead of time vs monitor. Depending on your industry, you can save effort by doing a little of both rather than a lot of either.

I can't really tell you whether you're spending too much time in validation or too little, but hopefully this helps give you some ideas of how you can answer that question for yourself.

HandsomeMLE t1_izdtpzc wrote on December 8, 2022 at 11:10 AM

After all, I take it all depends on what kind of model we're working on, how much we weigh the importance and likelihood of possible risks associated with it, and how to act and measure accordingly.

Thank you very much for your thoughtful input. It's been really helpful!