What KPIs Measure the Success of an AI Project?
The performance of a machine learning model is first assessed based on its success rate. Then about the compatibility of this rate with business objectives.
Join the DZone community and get the full member experience.
Join For FreeThe performance of a machine learning model is first assessed based on its success rate. Then about the compatibility of this rate with business objectives.
A subject still little discussed in the business press, the key results indicators ( KPIs ) of machine learning models is nevertheless a central element in piloting an artificial intelligence project. In June 2020, an IDC study showed that around 28% of AI initiatives failed. Reasons given by the American cabinet are a lack of expertise, a lack of relevant data, and a lack of sufficiently integrated development environments. Intending to set up a process of continuous improvement of machine learning and avoid ending up in the wall, identifying KPIs is now a priority.
Upstream, it is up to data scientists to define the technical performance indicators for the models. They will vary depending on the type of algorithm used. In the case of a regression that will aim to predict a person's height as a function of his age, recourse will be had, for example, to the linear coefficient of determination.
An equation that measures the prediction quality: if the square of the correlation coefficient is zero, the regression line determines 0% of the distribution of points, and conversely, if this coefficient is 100%, this figure is equal to 1. The prediction is then of excellent quality.
Deviation of the Prediction From Reality
Another indicator for evaluating a regression is the least-squares method, which refers to the loss function. It consists of quantifying an error by calculating the sum of the square of the deviation between the actual value and the predicted line (see the graph below), then fitting the model by minimizing the squared error. In the same logic, one will exploit the average absolute error method, which consists of calculating the average of the fundamental values of the deviations.
"In any case, this amounts to measuring the gap compared to what we are trying to predict," summarizes Charlotte Pierron-Perlès, in charge of strategy and data and AI services at Capgemini Invent in France, the ESN Capgemini advisory body.
In the case of classification algorithms used, for example, for spam detection, it will be necessary to look for false positives and false negatives. "For example, we worked for a cosmetics group on a machine learning solution aimed at optimizing the efficiency of production lines. The objective was to identify defective bottles at the start of the line that could lead to a production stoppage", explains Charlotte Pierron-Perlès.
After discussing with the boss and the factory operators, we oriented with the customer towards a model fulfilling its role even if it means detecting false negatives, that is to say, bottles in good condition that can then be replaced. in string input."
Based on the notions of false positives and false negatives, three other indicators allow classification models to be evaluated:
- The Recall (R) refers to a measure of the sensitivity of the model. It is the proportion of true positives (example: Covid tests positive with reason) identified correctly compared to all the true positives supposed to be detected (Covid tests positive with reason + Covid tests incorrectly negative): R = true positives / true positives + false negatives
- Precision (P) refers to a measure of accuracy. It is the proportion of true positives that are correct (the Covid tests positive for a good reason) compared to all the results identified as positive (tests for the Covid positive for the excellent reason + tests for the Covid positive incorrectly): P = true positives / true positives + false positives
- The Harmonic Mean (F-score) measures the capacity of the model to give correct predictions and to reject others: F = 2 x Precision x recall / Precision + recall
Generalization of the Model
"Once the model has been formed, its ability to generalize will be a key indicator," underlines David Tsang Hin Sun, lead senior data scientist at the French ESN Keyrus. How to estimate it? By measuring the difference between the prediction and the expected result, then the evolution of this difference over time. "After a while, we can be confronted with a divergence. This can come from an under-learning ( or overfitting, note ) due to a data set insufficient training in quality and quantity ", explains David Tsang Hin Sun.
The solution? For example, in the case of an image recognition model, we can use adversarial generative networks to increase the volume of photo learning by rotation or distortion. Another technique (adapted to a classification algorithm): synthetic minority over-sampling, which consists of increasing the number of low incidence examples in the data set by oversampling.
Divergence can also appear in the event of over-learning. In this configuration, the model, once trained, will not be limited to the expected correlations, but, being too specialized, it will capture the noise produced by the field data and generate inconsistent results. As a result, its error function will fall into the red. "It will then be necessary to review the quality of the training data set and possibly regularize the weight of the variables," indicates David Tsang Hin Sun.
There remains the economic KPIs. "We will have to ask ourselves if the error rate is compatible with the business challenges, " insists Stéphane Roder, the French consulting firm AI Builders CEO. "For example, the insurer Lemonade has developed a machine learning brick that reimburses a customer in 3 minutes following a claim based on information communicated, including photos. Given the savings made, he admits a certain error rate generating a cost." And Stéphane Roder added: "It is important to check that this measure remains in the nails throughout the life cycle of the model, in particular, compared to its TCO, from development to maintenance."
Adoption Level
The expected level of performance may vary, even within the same company. "For a French retailer of international stature, we have developed a consumption prediction engine. The model's precision objectives turned out to be different between department store products and new products", notes Charlotte Pierron-Perlès at Capgemini Invent. "The sales dynamic of the latter depends on elements, linked in particular to the reactions of the market, by definition less controllable." Hence a less ambitious target for the latter is accompanied by a different choice of algorithms.
Last KPI, and not the least: the level of adoption. "A model, even of good quality, is not sufficient on its own to be used. This requires the development of a product with a user experience oriented artificial intelligence, or AI UX, both accessible to the business and which materializes the promise of machine learning ", insists Charlotte Pierron-Perlès. And Stéphane Roder concludes: "This UX will also allow users to provide feedback, which will help qualitatively feed AI knowledge in addition to the daily production data flow."
Opinions expressed by DZone contributors are their own.
Comments