Today’s Agile/DevOps setups need the ability to go faster. Availability of huge amount of diversified test data might be critical to success of Test Automation effort. In this article we will discuss about synthetic test data, it’s importance and applications, various options available to us today for generating cheap and adequate test datasets using modern Test Data Management platforms and most importantly, we will examine how the power of AI/ML is being leveraged in this space.
More than 45% global population has now access to Social media, Mobility, Analytics, and Cloud based applications. Software testing needs several combinations of datasets to ensure the software product is doing its job flawlessly on its end user’s systems/devices. Testing without adequate and diversified test data in this scenario might lead to defects/flaws in software which can be a disaster as well.
Below are few examples around how testing can be misleading or can even go dangerously wrong due to lack of adequate test data:
- e-Commerce apps getting slow or even crashing during the Annual Sale season
- Some unfortunate air accidents that occurred in the past due to software malfunction due to wrong data from sensors
Testing approach in Agile/DevOps is based on “Test early and test often (shift-left)” philosophy demands large sets of production-like data in desired formats in the initial development phase of the software product. During this initial development phase Test Engineers adopt various methods or leverage traditional utilities like Spread-sheets etc. to generate test data for their test scripts. Below are a few methods listed around how we generally generate test data:
- Manual creation of Data files (Spreadsheets, CSVs, Audio/video files etc.)
- Using SQL statements/stored procedures
- Getting a copy/dump of source data (risky if the data is confidential)
- Leveraging an automated data generator/Test Data Manager (TDM)
Test data is required in different formats not only for functional testing but throughout the development Lifecyle as below:
- Functional Testing (Unit, Integration and system testing)
- Performance Testing (Load testing with thousands of concurrent users)
- Security Testing (adequate user profile data)
- Reliability Testing (testing with negative data)
- Configuration/compatibility testing (localization, Internationalization testing etc.)
- And so on
What is Synthetic Data?
As the name suggests, Synthetic means something that is created artificially, and in our context, it is Test Data which is artificially created by a data generator.
Data created by real customers or real end-users like UserID, password, Name, Age, Sex, photo, address, telephone numbers, emailID etc. are few examples real data.
These data could be more complex and vaster as we get into domains like Healthcare, Automobile, Digital, Social media and so on. And it’s not practical always to have high volume of these diversified data during the testing phase and we have to create them either manually or using a tool as we discussed in the introduction section above.
Below set of images is a good example of Synthetic Data which is created by an AI-powered algorithm. Please note, the images amazingly look like of real people but these people don’t exist actually.
We’ll discuss a bit more about how these AI-powered models work to generate high volume of synthetic data in the later part of this article. Let’s discuss about why do we need high volume of synthetic data.
Importance of production-like synthetic data
Production data (e.g. user’s profile data in a banking application) is secured and can’t be accessed for testing purpose. Hence, real-like & anonymized test data has to be created somehow for testing purpose.
Below are few reasons around why we can’t use real data and have to rely on synthetic data:
- Data usage restrictions or data protection standards: Real data might be protected under regulatory restrictions e.g. GDPR rules (EU data privacy laws), Export controlled data, PII data and so on. The real data format can be replicated/mimicked/masked by Synthetic data to overcome this challenge.
- No real data exists: When we develop an application from scratch (e.g. emerging technologies like Autonomous vehicle space), we need good amount of test data and here Synthetic data is a big help from testing stand point.
- Cost effectiveness: Generating synthetic data through an AI-powered data generation model is considerably more cost-effective and efficient than creating by manual or other methods.
- For testing AI/ML based applications: AI/ML based models need humongous amount of data to train and test their accuracy. Synthetic data is used for AI/ML based applications because real data is expensive and time & effort consuming as well.
How an AI-powered synthetic data generator works?
AI Models leverage deep neural networks with some additional privacy logic in order to generate unlimited amount of synthetic data that complies with global standards like GDPR, CCPA etc.
Most of the modern day’s synthetic data generators have nice user-friendly GUIs and with the click of a few buttons the platform enables you to generate an unlimited amount of highly realistic, but completely anonymous synthetic data. This AI-generated synthetic data looks pretty much like your actual customer data, is unprecedentedly accurate, and becomes a great alternative for your privacy-sensitive data.
Let’s have a look at such an AI model that is used at the heart of a modern AI-powered Synthetic Data Generator, called Generative Adversarial Networks (GANs).
GANs are modern Machine Learning models using deep learning methods which create new data that promisingly resembles to the input data. GANs can be used to solve complex problems like:
- Creating huge synthetic data for banking applications or any other domain where getting real time test data is challenging (IoT, Autonomous Vehicle data etc.)
- New images, videos and audio data can be created by inputting a few relevant sample data
- New music can be composed without playing any musical instruments
- Image quality can be enhanced with GAN networks without using any external artifacts
- Grayscale images or videos can be converted to color images and videos and so on
I hope you can now imagine how easy it is for an AI-powered Data Generator or Test Data Manager (TDM) to generate huge volume of data for your test automation needs. Now, lets have a look at several popular players in the industry in this space.
Some popular Synthetic Data generating tools/TDMs in the market
There are several commercial/open-source tools are available in the market today that can be highly cost effective and cycle-time effective as well in handling the demand of Test Data.
Below is a list of few popular commercial Synthetic Data Generators and TDM platforms (most of them are AI powered):
|Platform/Tool Name||Provider||Description (as per their official product info web-page)|
|Tonic||Tonic.ai (the Fake Data Company)||Fake data that looks, feels, and behaves like production. With Tonic.ai, your data is modeled from your production data to help you tell an identical story in your testing environments|
|Avo’s Intelligent Test Data Management||Avo Automation||Deliver better software faster with reliable test data management: Build software using representative test data that mimics your production. Make the software development process more cost-efficient.|
|MOSTLY GENERATE||MOSTLY||MOSTLY GENERATE is an enterprise-grade Synthetic Data Platform that preserves significantly more information and data value than any other data anonymization technique on the market. It enables you to overcome the barriers to AI and Big Data adoption. All while securely protecting your customers’ privacy.|
|EdgeCase||EdgeCase||Synthetic / Simulated Data are images and videos created to mirror the real world. Anything from the objects of a scene, to the weather, background, even camera placement and actions can be changed.|
Using the EdgeCase Data Platform creating data sets is as easy as clicking a button and generating data.
|YData||YData||The process of building datasets is now much faster and cheaper with automated preprocessing, labelling and synthetic data generation|
|BizDataX||BizDataX||BizDataX makes data masking/data anonymization simple, by cloning production or extracting only a subset of data. And mask it on the way, achieving GDPR compliance easier.|
|Test Data Manager||Broadcom (ca Technologies)||Find, build, manage, and deliver test data to everyone on your team. Let Test Data Manager find and deliver test data whenever and wherever needed.|
|Datprof||DatProf||The company provides several tools for Data Masking and test data provisioning. These tools are used by teams to simplify getting the right test data at the right place and right time.|
|Test Data Management||Informatica||Enable secure, automated provisioning of non-production data sets to meet development and testing needs.|
|Test Data Management||Compuware||Compuware’s Test Data Management solution, simplifies the complexity of test data management—for both test and production environments—through a standardized approach to managing data from multiple databases and file types.|
|Test Data Management||IBM||Optimizes and automates the test data management process, complete with workflows and services on demand for agile development and testing|
|Test Data Management||Solix||Solix TDM automates the creation of intelligently sized database subsets (not clones), that save up to 80% on storage space while still providing a syntactically correct copy of the production database needed to achieve the most accurate test results. Solix Test Data Management also features our Data Masking capability which helps DevOps teams to quickly identify sensitive data in the subsets and apply format preserving masking to ensure compliance and to protect against data breach.|
Things to be checked while dealing with Synthetic Data
While synthetic data can be cheap and easily created it has also limitations.
Below are few scenarios where human intervention might be required in order to ensure synthetic data is accurate for its intended use:
- Generated synthetic data might be biased based on the input data (data sometimes is under or over presented).
- Data might be unmeaningful unless proper constraint is imposed. Example: Human body temp can’t exceed 108.14°F, Max heart rate can’t exceed 220 BPM and so on.
- Privacy risks: If the input data contains outliers then there is a risk of getting the same data reproduced in the synthetic data as well. This might expose the real data (data leak) and might be challenging if the data is Private.
- Security risks: The AI Models can be vulnerable to some attacks if no privacy rules are in place. There are incidents where attackers were able to uncover the input data substantially via inversion attacks.
Test Data Management has evolved as a specialization by itself over time. Using realistic and varied test data during testing process is critical for the overall success of software release. Smart automation setups need cheap and high volume of test data that can be provisioned easily by AI-powered Synthetic Data Generators in order to meet the needed execution speed in today’s Agile/DevOps space. While generating unlimited synthetic data is an easy task but the data acceptance has to be ensured which might need some amount of human intervention.
Hope you enjoyed the article. Let us know what you think.