Test Automation Forum

in-depth testing of AI applications that use images

AI/ML Centric Testing / By Hari Krishna Para

Introduction: Generally, in the MLOps (methodology to develop ML based applications) we have design, develop and operations phases, wait something important is missing…I hope by now you got it, yes there is no testing phase in MLOps (like security, bias, performance etc.), but here is the question; how does ML applications are tested in order to make them Responsible-AI (RAI)? Have you ever thought how AI/ML based applications are tested? If you are someone who is curious about how AI/ML applications are tested then this article is for you. In this article, I’m going to discuss how did we test an AI-based Plant diagnostic application in order to make it reliable, robust and accurate.  Business case The challenge was to test a plant diagnosis application that supports various crop types. It was developed for farmers and gardeners to diagnose infected crops, offer treatments for diseases and nutrient deficiencies, and enable collaboration with other farmers and so on. The plant disease recognition is done by using AI image recognition technology (artificial intelligence based Neural Networks algorithm). How AI application testing is different: Compared to regular software applications, developing AI-based applications is different. With AI-based applications, we work with data and code.AI Application development process goes through steps like data collection, data cleaning, feature engineering, Model selection, Train & test and so on. And this is what AI application development is different from the traditional software development process. With most AI models, the data is split into two sets, one to train the model and the other to test the model. Once certain metrics are used to gauge the model’s performance on the test data, the model is either validated or sent back to the previous stage for revision. Do you think this level of testing is sufficient for an application that will make decisions, solve problems, and become part of people’s daily lives? Probably not! Let’s continue reading. How to test an AI app to ensure its reliability: There are several things that we can do to make an AI model more reliable, such as making it more robust. To achieve this, we need to test the AI models in different ways: Randomized testing- Test the Al system to evaluate how the model performs with unseen data. Cross-validation techniques- Evaluate the effectiveness of the model by iterating the metrics evaluation across several iterations of splits of the data.Example: K-FoId Cross validation, Bootstrap & LooCv etc. Test coverage- Pseudo Oracle Based Metamorphic testing, White box coverage-based testing, Layer level coverage, Neuron Coverage based testing. Test for bias- Test for the fairness of the ML model for any discriminatory behavior based on specific attributes like gender, race etc. Test for agency- Testing for closeness to human behavior. To compare two different models, to evaluate the Al ML models dimensions of Al quality like natural interaction and personality. Test for concept drift- Continuously check for data drift and hence the model drift which causes the deployed model to perform badly on newer data. Test for explainability- To enable testing for the “transparency of choices” element, we need to have a comprehensive approach to test the models for explainability. Security testing- Security testing for adversarial attacks is a primary component of any AI/ML test. We should test for potential attacks on current training data. Example: White Box and Black Box attacks. Test for Privacy- Test at model level for privacy attacks which makes it possible to infer data, and then to check if the inferred data has PII embedded inside it. Test for Performance- Check whether the system is able to handle different patterns of input loads, including spike pattern like e-commerce site during boxing day etc. How did we test the Plant Diagnosis application at our AI lab: In our process of testing the plant diagnosis application, we collected the data and model from our client in the required format. By using our strategic partner’s commercial state-of-the-art testing product called AIensured we tested the model. The results of the model having insights from both data and model performance was shared with the application owner. Following are the key benefits we provided to our client: Generated corner cases (cases where model fails to give actual result) and trained again on corner cases to increase its robustness. We used 11 attack vectors techniques like DeepFool, Universal Perturbation, Pixel Attack, Spatial Transformation etc., to know how robust it is against security attacks. The Model Explainability which includes both white box and black box explanations helped them to understand on which portion of the image their model is focusing and this helped them to know what caused the misclassification. To overcome the oracle problem (not having a defined output) we did metamorphic testing and that included techniques like rotation, shear, brightness etc., which helped them to know how the model is performing. Model quantization allowed them to reduce their model size without losing its accuracy. This helped them to incorporate their model on low-end electronic devices as well. List of the tests that were performed on their model are as depicted in the below graphics: Results: Bottom line is, after retraining the model with generated corner cases, the performance of the model was found to be increased by around 12%. The report shared by us helped them to make their model explainable and ensured compliance with the required privacy governance and above all, we made their model responsible and robust to security attacks and improved overall performance of the model. I hope this article was insightful! Please don’t hesitate to contact me in case you have a question or suggestions. Happy learning! 6+

10 Most Recommended Tests for your AI/ML/DL Models in 2022

AI/ML Centric Testing / By Srinivas Padmanabhuni

In recent past there has been a spate of accidents involving AI and Machine learning models in practice and deployment. Much so that there is an active database of all such accidents being chronicled (https://incidentdatabase.ai/ ). At a time when AI is making strides in radical business transformation for enterprises, it is vital that we ensure seamless deployments of AI in real transformational scenarios. To ensure such seamless deployments it is vital that we ensure a quality, trustworthy and responsible AI. A critical need to ensure quality, trustworthy, and responsible AI is the focused effort to test AI and ML and DL models thoroughly. In a previous article Why Current Testing Processes In AI/ML Are Not Enough? we showed how existing techniques and processes are not sufficient to ensure a quality, trustworthy and responsible AI. Here in this article we intend to elucidate the complete set of tests as required for an AI model to be able to ensure a quality, trustworthy and responsible AI. We shall enumerate and define each of these tests for AI ML DL models below. 1. Randomized Testing with Train-Test Split: At the core of the article Why Current Testing Processes In AI/ML Are Not Enough? we illustrated that current foundations of testing in ML life cycle rests on the principle of splitting the data into training and test data and testing for metrics on the test data. Metrics could vary from accuracy in classification to MSE in regression. The basic idea is to test how the model performs on unseen data. 2. Cross Validation techniques This is an effective model evaluation technique set which is currently in vogue as part of the ML process. Here again the basic idea is to test how the model performs on unseen data. The idea is to the evaluate the effectiveness of the model by iterating the metrics evaluation across several iterations of splits of the data. This can again be ensured by any of the three techniques below K-Fold Cross Validation: Here the data is split into k parts and each iteration one of the k parts becomes test set and remaining k-1 parts become the training set and metrics are averages across iterations. LOOCV: An extreme form of K Fold cross validation where a single data item is created as test set and remaining n-1 items are treated as train set and over n (size of data) the metrics are averaged Bootstrap: Here the idea is to create a new data set from existing data set of same size by sampling with replacement, and metrics evaluated over several such iterations. These abovementioned test techniques are quite prevalent in today s AI ML DL deployments. However as highlighted in https://medium.com/@srinivaspadmanabhuni/why-current-testing-processes-in-ai-ml-are-not-enough-f9a53b603ec6 these may not be enough to deal with scenarios like corner cases, performance issues, security issues, privacy issues, transparency issues, and fairness/bias issues. Hence we need to expand the scope of testing to cover broader aspects to ensure a quality, trustworthy and responsible AI. To set a benchmark for such a repertoire of tests, we shall refer the quality dimensions of AI in addition the standard ones as defined in ISO25010 in the talk by Rick Marselis at https://www.slideshare.net/RikMarselis/testing-intelligent-machines-approaches-and-techniques-qatest-bilbao-2018 In addition the standard ISO25010 quality metrics, there are three additional quality metrics proposed for testing AI/ML systems. These are as below: a. Intelligent Behaviour: It can be a test for evaluating the intelligence of the system. Within this the traits that can be tested include test for ability to learn, improvisation, transparency of choices, collaboration and naturalness of the interaction. b. Morality: It can be a test for evaluating the moral dimensions of the AI system. This can include broad tests for ethics (including bias), privacy, and human friendliness. c. Personality: This is closely related to testing humanness of the AI system. It includes tests for mood, empathy, humour , charisma like dimensions. In view of this discussion it is vital we evolve a testing strategy involving a comprehensive set of tests for AI/ML systems to look at both these additional dimensions of quality as well as standard dimensions from ISO 25010 perspective. Let us look at some of the important tests we need to incorporate from these additional quality attributes perspective. 3. Tests for Explainability: In order to enable testing for the “transparency of choices” element under Intelligent behavior as above, we need to have a comprehensive approach to test the models for explainability. As we discussed in https://medium.com/@srinivaspadmanabhuni/why-some-ml-models-required-to-have-explainability-fc190906a9c8 these are specifically required when models in AI ML are not interpretable, like neural networks etc. In case of interpretable models, it is fairly easier to get information on the rationale of an inference by an ML model. However in complex models like neural networks these have to be tested for explainability where we test for rationale for any decision. This whole area broadly referred to as XAI (Explainable AI) framed by DARPA at https://www.darpa.mil/program/explainable-artificial-intelligence These explainability tests can be again of two types: Model Agnostic Tests: These tests do not take into account any specific details of the ML model and perform independent of the model, much like the black box testing models. Examples include LIME etc. Model Specific Tests: These explainability tests take into account specifics of the model under consideration. Like if you have a CNN like model, you can use GRAD-CAM like model to transparently look at the rationale of the decision. 4. Security Testing for AI/ML models: In context of the quality attributes in ISO25010 security with its broad needs of (Confidentiality, Integrity, Availability) becomes a vital quality attribute to be tested. In case of AI/ML the specific security needs arise from the new category of threats namely adversarial attacks which attack models with poisoned data, and fool the models. Important that we include security testing for adversarial attacks a primary component of any AI/ML test. We should test for potential attacks on current training data. This kind of test can simulate both kinds of attacks below: White Box attacks: Here there is a knowledge of the parameters

Why current testing processes in AI/ML are not enough?

1 Comment / AI/ML Centric Testing / By Srinivas Padmanabhuni

The current notions of quality assurance and testing in AI/ML pipelines is based on the idea of validation using a random set-aside set of data on which the model is tested and metrics computed thereof. Metrics like accuracy on the random set-aside data set termed ambiguously as test data, is the usual rubric for evaluation of the effectiveness of the ML models. But this only gives a partial picture of the quality of the model, which is not sufficient to guarantee good performance on deployment. Probably it is because of the terminology of “test data” used in the process that the big picture of testing is missed out in the ML life cycle. There are however some additional validation mechanisms also suggested to further boost the evaluation process like K Fold cross validation Bootstrap Leave One Out Cross Validation However all the above validation approaches including randomized train test split mechanisms are based on the notion that testing the model on randomized unseen data is a good enough validation of the corresponding model. We feel that is an incomplete picture which is not complete to guarantee overall performance of the model in the field. Here are a set of qualified reasons as to why we need to think beyond current model evaluation and validation approaches to guarantee AI / ML model quality. The random selection of test set including cross validation based approaches do not guarantee a comprehensive coverage of the input scenarios, especially corner cases which are rare in nature. Even though cross validation approaches try to cover the overall spectrum via k-fold approach, a systematic approach to understand and debug as to the performance of the model for different variable scenarios of inputs is not possible. Hence detecting what types of input variations are not being sufficiently represented in the model, is impossible in current approaches. Testing for security, an important non functional IT requirement, is totally absent in current model evaluation approaches. Not to think of application security, now AI models themselves need to be audited for AI specific attacks, hence there is a need for comprehensive security testing of AI/ML models. In terms of compliance oriented sectors there is an increased push for generation of explanations or rationale for the AI/ML model decisions. So testing for explainability is must for today s AI ML models. Performance of AI/ML models is to be tested independent of the system in which the AI/ML models are deployed. Because there are specific deployment formats like tinyML which need a comprehensive validation of performance at model level. Privacy as well as GDPR imposed constraints on data and derived AI models are a huge set of desiderata for AI ML applications. So testing AI ML models for privacy breaches or attacks and leaks forms an important component of the overall requirements to certify and audit AI models. Testing and assurance of fairness and bias in AI models is an important requirement of AI models to ensure that they do not get recalled or rescinded. Finally testing of data quality at input level before being fed to the ML process is vital, as a lot of quality issues at model arise due to which we need to ensure testing of quality at input data level before being fed to the AI / ML model. In several scenarios there is not sufficient data to test the AI models. In those scenarios the data adequacy of the models need to be tested and if need be mechanisms to augment test data, be made available Overall these desiderata really point us to the requirement of standalone frameworks and processes and products for AI testing which can handle all the abovementioned tests for ML models of all types. To ensure a trustworthy and responsible AI a comprehensive set of tests of all the points above is a mandatory requirement. — Dr. Srinivas Padmanabhuni testAIng.com Note: The article has been republished here with prior approval from the author. About the Author Dr. Srinivas Padmanabhuni works for TestAIng as their CTO. He is a well known personality in the field of Artificial Intelligence (AI) and is recognised for his significant contributions in AI. Dr. Srinivas Padmanabhuni is a Ph.D. in Artificial Intelligence. He speaks in several premeire institutes, forums and authorded several technical articles/books in AI/Data Science. About TestAIng (testAIng.com) testAIng.com (pronounced as tAI) is a leader in testing AI Systems using their state-of-the-art techniques, tools and technologies.They have combined their deep experience in testing along with AI to create a unique and one-of-its-kind proposition for testers who want to either use AI in their testing process or get their AI systems tested. 3+

Test data management using AI-powered synthetic data generators

AI/ML Centric Testing, Full Automation Testing / By Tri Rath

Today’s Agile/DevOps setups need the ability to go faster. Availability of huge amount of diversified test data might be critical to success of Test Automation effort. In this article we will discuss about synthetic test data, it’s importance and applications, various options available to us today for generating cheap and adequate test datasets using modern Test Data Management platforms and most importantly, we will examine how the power of AI/ML is being leveraged in this space. Introduction More than 45% global population has now access to Social media, Mobility, Analytics, and Cloud based applications. Software testing needs several combinations of datasets to ensure the software product is doing its job flawlessly on its end user’s systems/devices. Testing without adequate and diversified test data in this scenario might lead to defects/flaws in software which can be a disaster as well. Below are few examples around how testing can be misleading or can even go dangerously wrong due to lack of adequate test data: e-Commerce apps getting slow or even crashing during the Annual Sale season Some unfortunate air accidents that occurred in the past due to software malfunction due to wrong data from sensors Testing approach in Agile/DevOps is based on “Test early and test often (shift-left)” philosophy demands large sets of production-like data in desired formats in the initial development phase of the software product. During this initial development phase Test Engineers adopt various methods or leverage traditional utilities like Spread-sheets etc. to generate test data for their test scripts. Below are a few methods listed around how we generally generate test data: Manual creation of Data files (Spreadsheets, CSVs, Audio/video files etc.) Using SQL statements/stored procedures Getting a copy/dump of source data (risky if the data is confidential) Leveraging an automated data generator/Test Data Manager (TDM) Test data is required in different formats not only for functional testing but throughout the development Lifecyle as below: Functional Testing (Unit, Integration and system testing) Performance Testing (Load testing with thousands of concurrent users) Security Testing (adequate user profile data) Reliability Testing (testing with negative data) Configuration/compatibility testing (localization, Internationalization testing etc.) And so on What is Synthetic Data? As the name suggests, Synthetic means something that is created artificially, and in our context, it is Test Data which is artificially created by a data generator. Data created by real customers or real end-users like UserID, password, Name, Age, Sex, photo, address, telephone numbers, emailID etc. are few examples real data. These data could be more complex and vaster as we get into domains like Healthcare, Automobile, Digital, Social media and so on. And it’s not practical always to have high volume of these diversified data during the testing phase and we have to create them either manually or using a tool as we discussed in the introduction section above. Below set of images is a good example of Synthetic Data which is created by an AI-powered algorithm. Please note, the images amazingly look like of real people but these people don’t exist actually. Source: https://www.thispersondoesnotexist.com/ We’ll discuss a bit more about how these AI-powered models work to generate high volume of synthetic data in the later part of this article. Let’s discuss about why do we need high volume of synthetic data. Importance of production-like synthetic data Production data (e.g. user’s profile data in a banking application) is secured and can’t be accessed for testing purpose. Hence, real-like & anonymized test data has to be created somehow for testing purpose. Below are few reasons around why we can’t use real data and have to rely on synthetic data: Data usage restrictions or data protection standards: Real data might be protected under regulatory restrictions e.g. GDPR rules (EU data privacy laws), Export controlled data, PII data and so on. The real data format can be replicated/mimicked/masked by Synthetic data to overcome this challenge. No real data exists: When we develop an application from scratch (e.g. emerging technologies like Autonomous vehicle space), we need good amount of test data and here Synthetic data is a big help from testing stand point. Cost effectiveness: Generating synthetic data through an AI-powered data generation model is considerably more cost-effective and efficient than creating by manual or other methods. For testing AI/ML based applications: AI/ML based models need humongous amount of data to train and test their accuracy. Synthetic data is used for AI/ML based applications because real data is expensive and time & effort consuming as well. How an AI-powered synthetic data generator works? AI Models leverage deep neural networks with some additional privacy logic in order to generate unlimited amount of synthetic data that complies with global standards like GDPR, CCPA etc. Most of the modern day’s synthetic data generators have nice user-friendly GUIs and with the click of a few buttons the platform enables you to generate an unlimited amount of highly realistic, but completely anonymous synthetic data. This AI-generated synthetic data looks pretty much like your actual customer data, is unprecedentedly accurate, and becomes a great alternative for your privacy-sensitive data. Let’s have a look at such an AI model that is used at the heart of a modern AI-powered Synthetic Data Generator, called Generative Adversarial Networks (GANs). GANs are modern Machine Learning models using deep learning methods which create new data that promisingly resembles to the input data. GANs can be used to solve complex problems like: Creating huge synthetic data for banking applications or any other domain where getting real time test data is challenging (IoT, Autonomous Vehicle data etc.) New images, videos and audio data can be created by inputting a few relevant sample data New music can be composed without playing any musical instruments Image quality can be enhanced with GAN networks without using any external artifacts Grayscale images or videos can be converted to color images and videos and

Welcome to TAF - Your favourite Knowledge Base for the latest Quality Engineering updates.