How to Test AI models: An Introduction Guide for QA
Get some simple answers to frequently asked questions regarding quality assurance within machine learning.
Join the DZone community and get the full member experience.
Join For FreeThe latest trends show that machine learning is one of the most rapidly developing fields in computer science. Unfortunately, it’s still unclear to some customers who are neither data scientists nor ML developers how to handle it, but they do know that they need to incorporate AI into their products.
Here are the most frequently asked questions we get from the customers regarding quality assurance within ML:
- I want to run UAT; could you please provide your full regression test cases for AI?
- OK, I have the model running in production; how can we assure it doesn’t break when we update it?
- How can I make sure it produces the right values I need?
What are some simple answers here?
A Brief Introduction to Machine Learning
In order to get how ML works, let’s take a closer look at ML model essence.
What is the difference between classical algorithms/hardcoded functions and ML-based models?
From the black-box perspective, it’s the same box with input and output. Fill the inputs in, get the outputs — what a beautiful thing!
From the white-box perspective, and specifically from how the system is built, it’s a bit different. The core difference here is:
- You write the function, or
- The function is fitted by a specific algorithm based on your data.
You can verify the ETL part of model coefficients, but you can't verify model quality just as easily as other parameters.
So, What About QA?
The model review procedure is similar to code review but tailored for data science teams. I haven’t seen a lot of QA engineers participating in this particular procedure, but then comes model quality assessment, improvement, etc. The assessment itself is usually happening inside the data science team.
Traditional QA happens for integration cases. Here are five points indicating that you have reasonable quality assurance when dealing with machine learning models in production:
- You have a service based on ML functions that is deployed in production. It’s up and running and you want to control that it’s not broken by an automatically deployed new version of the model. In this case, there's a pure black-box scenario: load the test dataset and verify that it has an acceptable output (for example, compare it to the predeployment stage results). Keep in mind: it’s not about exact matching; it’s about the best-suggested value. So, you need to be aware of acceptable dispersion rate.
- You want to verify that deployed ML functions process the data correctly (i.e. +/- inversion). That’s where the white-box approach works best: use unit and integration tests for correct input data loading in the model, check for the right (+\-), and check feature output. Wherever you use ETL, it’s good to have white-box checks.
- Production data can mutate; the same input produces new expected output with time. For example, something changes user behavior and the quality of model falls. The other case is dynamically changing data. If that risk is high, here are two approaches:
- Simple, but expensive approach: Retrain daily on the new dataset. In this case, you need to find the right balance for your service since retraining is highly related to your infrastructure cost.
- Complex approach: Depends on how you collect the feedback. For binary classification, for example, you can calculate metrics: precision, recall, and f1 score. Write a service with dynamic model scoring based on these parameters. If it falls below 0.6, it’s an alert; if it falls below 0.5, it’s a critical incident.
- Public beta tests work very well for certain cases. You assess your model quality on data that wasn’t used previously. For instance, add 300 more users to generate data and process it. Ideally, the more new data you test on, the better. The original dataset is good, but a larger amount of high-quality data is always better. Note: Test data extrapolation is not a good case here; your model should work well with real users, not on the predicted or generated data.
- Automatically ping the service to make sure it’s alive (not specifically ML testing, but shouldn’t be forgotten). Use Pingdom. Yeah, this simple thing saved face a lot of times. There are a lot of more advanced DevOps solutions out there; however, for us, everything started with this solution — and we benefited a lot from it.
Answers
These answers cover pretty much everything concerning QA participation. Now, let’s answer the customers’ questions we set at the beginning of this article.
- I want to run UAT; could you please provide your full regression test cases for AI?
- Describe the black box to the customer, and provide them with test data and a service that can process and visualize the output.
- Describe all the testing layers, whether you verify data and model features on ETL layers, and how you do it.
- Produce a model quality report. Provide the customer with model quality metrics vs. standard values. Get these from your data scientist.
- OK, I have the model running in production; how can we assure it doesn’t break when we update it?
- You need to have a QA review of any production push as well as for any other software.
- Perform black-box smoke test. Try various types of inputs based on the function.
- Verify model metrics on the production server with a sample of test data. If needed, isolate the part of prod server so the users aren’t affected by the test.
- Of course, make sure your white box tests are passing.
- How can I make sure it produces the right values I need?
- You should always be aware of the acceptable standard deviation for your model and data. Spend some time with your data scientist and dig deeper into model type and technical aspects of the algorithms.
Any other questions you have in mind? Let’s try to figure them out and get the answers!
Opinions expressed by DZone contributors are their own.
Comments