Test Automation Forum

Welcome to TAF - Your favourite Knowledge Base for the latest Quality Engineering updates.

Test
Automation
Forum

(Focused on Functional, Performance, Security and AI/ML Testing)

Brought to you by MOHS10 Technologies

Srinivas Padmanabhuni

image

10 Most Recommended Tests for your AI/ML/DL Models in 2022

In recent past there has been a spate of accidents involving AI and Machine learning models in practice and deployment. Much so that there is an active database of all such accidents being chronicled (https://incidentdatabase.ai/ ). At a time when AI is making strides in radical business transformation for enterprises, it is vital that we ensure seamless deployments of AI in real transformational scenarios. To ensure such seamless deployments it is vital that we ensure a quality, trustworthy and responsible AI. A critical need to ensure quality, trustworthy, and responsible AI is the focused effort to test AI and ML and DL models thoroughly. In a previous article Why Current Testing Processes In AI/ML Are Not Enough? we showed how existing techniques and processes are not sufficient to ensure a quality, trustworthy and responsible AI. Here in this article we intend to elucidate the complete set of tests as required for an AI model to be able to ensure a quality, trustworthy and responsible AI. We shall enumerate and define each of these tests for AI ML DL models below. 1. Randomized Testing with Train-Test Split: At the core of the article Why Current Testing Processes In AI/ML Are Not Enough? we illustrated that current foundations of testing in ML life cycle rests on the principle of splitting the data into training and test data and testing for metrics on the test data. Metrics could vary from accuracy in classification to MSE in regression. The basic idea is to test how the model performs on unseen data. 2. Cross Validation techniques This is an effective model evaluation technique set which is currently in vogue as part of the ML process. Here again the basic idea is to test how the model performs on unseen data. The idea is to the evaluate the effectiveness of the model by iterating the metrics evaluation across several iterations of splits of the data. This can again be ensured by any of the three techniques below K-Fold Cross Validation: Here the data is split into k parts and each iteration one of the k parts becomes test set and remaining k-1 parts become the training set and metrics are averages across iterations. LOOCV: An extreme form of K Fold cross validation where a single data item is created as test set and remaining n-1 items are treated as train set and over n (size of data) the metrics are averaged Bootstrap: Here the idea is to create a new data set from existing data set of same size by sampling with replacement, and metrics evaluated over several such iterations. These abovementioned test techniques are quite prevalent in today s AI ML DL deployments. However as highlighted in https://medium.com/@srinivaspadmanabhuni/why-current-testing-processes-in-ai-ml-are-not-enough-f9a53b603ec6 these may not be enough to deal with scenarios like corner cases, performance issues, security issues, privacy issues, transparency issues, and fairness/bias issues. Hence we need to expand the scope of testing to cover broader aspects to ensure a quality, trustworthy and responsible AI. To set a benchmark for such a repertoire of tests, we shall refer the quality dimensions of AI in addition the standard ones as defined in ISO25010 in the talk by Rick Marselis at https://www.slideshare.net/RikMarselis/testing-intelligent-machines-approaches-and-techniques-qatest-bilbao-2018 In addition the standard ISO25010 quality metrics, there are three additional quality metrics proposed for testing AI/ML systems. These are as below: a. Intelligent Behaviour: It can be a test for evaluating the intelligence of the system. Within this the traits that can be tested include test for ability to learn, improvisation, transparency of choices, collaboration and naturalness of the interaction. b. Morality: It can be a test for evaluating the moral dimensions of the AI system. This can include broad tests for ethics (including bias), privacy, and human friendliness. c. Personality: This is closely related to testing humanness of the AI system. It includes tests for mood, empathy, humour , charisma like dimensions. In view of this discussion it is vital we evolve a testing strategy involving a comprehensive set of tests for AI/ML systems to look at both these additional dimensions of quality as well as standard dimensions from ISO 25010 perspective. Let us look at some of the important tests we need to incorporate from these additional quality attributes perspective. 3. Tests for Explainability: In order to enable testing for the “transparency of choices” element under Intelligent behavior as above, we need to have a comprehensive approach to test the models for explainability. As we discussed in https://medium.com/@srinivaspadmanabhuni/why-some-ml-models-required-to-have-explainability-fc190906a9c8 these are specifically required when models in AI ML are not interpretable, like neural networks etc. In case of interpretable models, it is fairly easier to get information on the rationale of an inference by an ML model. However in complex models like neural networks these have to be tested for explainability where we test for rationale for any decision. This whole area broadly referred to as XAI (Explainable AI) framed by DARPA at https://www.darpa.mil/program/explainable-artificial-intelligence These explainability tests can be again of two types: Model Agnostic Tests: These tests do not take into account any specific details of the ML model and perform independent of the model, much like the black box testing models. Examples include LIME etc. Model Specific Tests: These explainability tests take into account specifics of the model under consideration. Like if you have a CNN like model, you can use GRAD-CAM like model to transparently look at the rationale of the decision. 4. Security Testing for AI/ML models: In context of the quality attributes in ISO25010 security with its broad needs of (Confidentiality, Integrity, Availability) becomes a vital quality attribute to be tested. In case of AI/ML the specific security needs arise from the new category of threats namely adversarial attacks which attack models with poisoned data, and fool the models. Important that we include security testing for adversarial attacks a primary component of any AI/ML test. We should test for potential attacks on current training data. This kind of test can simulate both kinds of attacks below: White Box attacks: Here there is a knowledge of the parameters

image

Why current testing processes in AI/ML are not enough?

The current notions of quality assurance and testing in AI/ML pipelines is based on the idea of validation using a random set-aside set of data on which the model is tested and metrics computed thereof. Metrics like accuracy on the random set-aside data set termed ambiguously as test data, is the usual rubric for evaluation of the effectiveness of the ML models. But this only gives a partial picture of the quality of the model, which is not sufficient to guarantee good performance on deployment. Probably it is because of the terminology of “test data” used in the process that the big picture of testing is missed out in the ML life cycle. There are however some additional validation mechanisms also suggested to further boost the evaluation process like K Fold cross validation Bootstrap Leave One Out Cross Validation However all the above validation approaches including randomized train test split mechanisms are based on the notion that testing the model on randomized unseen data is a good enough validation of the corresponding model. We feel that is an incomplete picture which is not complete to guarantee overall performance of the model in the field. Here are a set of qualified reasons as to why we need to think beyond current model evaluation and validation approaches to guarantee AI / ML model quality.   The random selection of test set including cross validation based approaches do not guarantee a comprehensive coverage of the input scenarios, especially corner cases which are rare in nature. Even though cross validation approaches try to cover the overall spectrum via k-fold approach, a systematic approach to understand and debug as to the performance of the model for different variable scenarios of inputs is not possible. Hence detecting what types of input variations are not being sufficiently represented in the model, is impossible in current approaches. Testing for security, an important non functional IT requirement, is totally absent in current model evaluation approaches. Not to think of application security, now AI models themselves need to be audited for AI specific attacks, hence there is a need for comprehensive security testing of AI/ML models. In terms of compliance oriented sectors there is an increased push for generation of explanations or rationale for the AI/ML model decisions. So testing for explainability is must for today s AI ML models. Performance of AI/ML models is to be tested independent of the system in which the AI/ML models are deployed. Because there are specific deployment formats like tinyML which need a comprehensive validation of performance at model level. Privacy as well as GDPR imposed constraints on data and derived AI models are a huge set of desiderata for AI ML applications. So testing AI ML models for privacy breaches or attacks and leaks forms an important component of the overall requirements to certify and audit AI models. Testing and assurance of fairness and bias in AI models is an important requirement of AI models to ensure that they do not get recalled or rescinded. Finally testing of data quality at input level before being fed to the ML process is vital, as a lot of quality issues at model arise due to which we need to ensure testing of quality at input data level before being fed to the AI / ML model. In several scenarios there is not sufficient data to test the AI models. In those scenarios the data adequacy of the models need to be tested and if need be mechanisms to augment test data, be made available Overall these desiderata really point us to the requirement of standalone frameworks and processes and products for AI testing which can handle all the abovementioned tests for ML models of all types. To ensure a trustworthy and responsible AI a comprehensive set of tests of all the points above is a mandatory requirement. — Dr. Srinivas Padmanabhuni   testAIng.com     Note: The article has been republished here with prior approval from the author.   About the Author Dr. Srinivas Padmanabhuni works for TestAIng as their CTO. He is a well known personality in the field of Artificial Intelligence (AI) and is recognised for his significant contributions in AI. Dr. Srinivas Padmanabhuni is a Ph.D. in Artificial Intelligence. He speaks in several premeire institutes, forums and authorded several technical articles/books in AI/Data Science. About TestAIng (testAIng.com) testAIng.com (pronounced as tAI) is a leader in testing AI Systems using their state-of-the-art techniques, tools and technologies.They have combined their deep experience in testing along with AI to create a unique and one-of-its-kind proposition for testers who want to either use AI in their testing process or get their AI systems tested. 3+

Submit your article summary today!

[wpforms id="2606"]
Contact Form

Thank you for your interest in authoring an article for this forum. We are very excited about it!

Please provide a high level summary of your topic as in the form below. We will review and reach out to you shortly to take it from here. Once your article is accepted for the forum, we will be glad to offer you some amazing Amazon gift coupons.

You can also reach out to us at info@testautomationforum.com