10 Most Recommended Tests for your AI/ML/DL Models in 2022
In recent past there has been a spate of accidents involving AI and Machine learning models in practice and deployment. Much so that there is an active database of all such accidents being chronicled (https://incidentdatabase.ai/ ). At a time when AI is making strides in radical business transformation for enterprises, it is vital that we ensure seamless deployments of AI in real transformational scenarios. To ensure such seamless deployments it is vital that we ensure a quality, trustworthy and responsible AI. A critical need to ensure quality, trustworthy, and responsible AI is the focused effort to test AI and ML and DL models thoroughly. In a previous article Why Current Testing Processes In AI/ML Are Not Enough? we showed how existing techniques and processes are not sufficient to ensure a quality, trustworthy and responsible AI. Here in this article we intend to elucidate the complete set of tests as required for an AI model to be able to ensure a quality, trustworthy and responsible AI. We shall enumerate and define each of these tests for AI ML DL models below. 1. Randomized Testing with Train-Test Split: At the core of the article Why Current Testing Processes In AI/ML Are Not Enough? we illustrated that current foundations of testing in ML life cycle rests on the principle of splitting the data into training and test data and testing for metrics on the test data. Metrics could vary from accuracy in classification to MSE in regression. The basic idea is to test how the model performs on unseen data. 2. Cross Validation techniques This is an effective model evaluation technique set which is currently in vogue as part of the ML process. Here again the basic idea is to test how the model performs on unseen data. The idea is to the evaluate the effectiveness of the model by iterating the metrics evaluation across several iterations of splits of the data. This can again be ensured by any of the three techniques below K-Fold Cross Validation: Here the data is split into k parts and each iteration one of the k parts becomes test set and remaining k-1 parts become the training set and metrics are averages across iterations. LOOCV: An extreme form of K Fold cross validation where a single data item is created as test set and remaining n-1 items are treated as train set and over n (size of data) the metrics are averaged Bootstrap: Here the idea is to create a new data set from existing data set of same size by sampling with replacement, and metrics evaluated over several such iterations. These abovementioned test techniques are quite prevalent in today s AI ML DL deployments. However as highlighted in https://medium.com/@srinivaspadmanabhuni/why-current-testing-processes-in-ai-ml-are-not-enough-f9a53b603ec6 these may not be enough to deal with scenarios like corner cases, performance issues, security issues, privacy issues, transparency issues, and fairness/bias issues. Hence we need to expand the scope of testing to cover broader aspects to ensure a quality, trustworthy and responsible AI. To set a benchmark for such a repertoire of tests, we shall refer the quality dimensions of AI in addition the standard ones as defined in ISO25010 in the talk by Rick Marselis at https://www.slideshare.net/RikMarselis/testing-intelligent-machines-approaches-and-techniques-qatest-bilbao-2018 In addition the standard ISO25010 quality metrics, there are three additional quality metrics proposed for testing AI/ML systems. These are as below: a. Intelligent Behaviour: It can be a test for evaluating the intelligence of the system. Within this the traits that can be tested include test for ability to learn, improvisation, transparency of choices, collaboration and naturalness of the interaction. b. Morality: It can be a test for evaluating the moral dimensions of the AI system. This can include broad tests for ethics (including bias), privacy, and human friendliness. c. Personality: This is closely related to testing humanness of the AI system. It includes tests for mood, empathy, humour , charisma like dimensions. In view of this discussion it is vital we evolve a testing strategy involving a comprehensive set of tests for AI/ML systems to look at both these additional dimensions of quality as well as standard dimensions from ISO 25010 perspective. Let us look at some of the important tests we need to incorporate from these additional quality attributes perspective. 3. Tests for Explainability: In order to enable testing for the “transparency of choices” element under Intelligent behavior as above, we need to have a comprehensive approach to test the models for explainability. As we discussed in https://medium.com/@srinivaspadmanabhuni/why-some-ml-models-required-to-have-explainability-fc190906a9c8 these are specifically required when models in AI ML are not interpretable, like neural networks etc. In case of interpretable models, it is fairly easier to get information on the rationale of an inference by an ML model. However in complex models like neural networks these have to be tested for explainability where we test for rationale for any decision. This whole area broadly referred to as XAI (Explainable AI) framed by DARPA at https://www.darpa.mil/program/explainable-artificial-intelligence These explainability tests can be again of two types: Model Agnostic Tests: These tests do not take into account any specific details of the ML model and perform independent of the model, much like the black box testing models. Examples include LIME etc. Model Specific Tests: These explainability tests take into account specifics of the model under consideration. Like if you have a CNN like model, you can use GRAD-CAM like model to transparently look at the rationale of the decision. 4. Security Testing for AI/ML models: In context of the quality attributes in ISO25010 security with its broad needs of (Confidentiality, Integrity, Availability) becomes a vital quality attribute to be tested. In case of AI/ML the specific security needs arise from the new category of threats namely adversarial attacks which attack models with poisoned data, and fool the models. Important that we include security testing for adversarial attacks a primary component of any AI/ML test. We should test for potential attacks on current training data. This kind of test can simulate both kinds of attacks below: White Box attacks: Here there is a knowledge of the parameters

