Synthetic Data Generation Using GANs - Impetus

Synthetic Data Generation Using GANs

Generative Adversarial Networks (GANs) are a powerful machine learning technique for generating synthetic data that is indistinguishable from real data. GANs have been used to generate synthetic images, text, audio, and video and have applications in a wide range of fields, including healthcare, finance, and security.

GANs work by pitting two neural networks against each other: a generator and a discriminator. The generator’s goal is to create synthetic data that is as realistic as possible, while the discriminator’s goal is to distinguish between real and synthetic data. The generator and discriminator are trained simultaneously, and over time, the generator learns to create increasingly realistic synthetic data.

This blog will explore the fundamentals of GANs and their application to synthetic data generation that our Data Science team has been working on.

Why do we need GAN?

Data teams all over the world face a dilemma: whether to use production-ready data or generate synthetic data for testing. Using production data can lead to the loss of sensitive customer information, which can be overcome using synthetic data instead. GANs help to generate synthetic data, reducing the security risk of losing confidential client data at minimal cost. With numerous GAN models available on the market, it has decreased the time to market for the newly generated data. The synthetic data generated from a GAN model is of high quality with data distribution like production-ready data.

GANs help synthesize data in the local deployment environment and can be extended to any cloud service.

How does it work??

GANs are a deep-learning-based generative model and have two sub-models:

  • The generator model, which we train to generate new examples
  • The discriminator model, which classifies examples as real (from the domain) or fake (generated)

The two models are trained together in the following way:

  • The generator generates perfect replicas from the input domain every time
  • The discriminator successfully identifies real and fake samples

When the generator fools the discriminator, it is rewarded, or no change is needed to the model parameters, but the discriminator is penalized, and its model parameters are updated.

GAN model architecture

Let’s deep dive into the GAN architecture:

The Discriminator

The task of the discriminator is to identify between real and fake data. To become proficient, it is trained on two data inputs —

  • The generator-produced data (which we can call as fake)
  • The given data (which we can label as real)

Let’s say the generator synthesized (fake) data is labelled as ‘0’ and the real data is labelled as ‘1’. The discriminator then processes this data, predicting either a ‘0’ or ‘1’ on the examples it sees.

Real images of numbers | Fake images of numbers

As in the above example, the discriminator needs to become good at correctly identifying the right group as the fake numbers and the left group as the real ones.

More technically, the discriminator will return the probability that a given example is a real example. If this probability is above a certain threshold (for example, 0.5), the discriminator will determine the example to be real and return 1. Otherwise, the discriminator will return 0.

Note that training a discriminator is a supervised learning task. We explicitly provide target labels as ‘0′ and ‘1’ to the discriminator. 

The Generator

The generator network takes random data as input (mathematically, we can think of it as an n-dimensional vector derived from a latent space) and transforms this data to generate examples that can fool the discriminator.

The performance of the generator depends upon the quality of the discriminator. Hence, training the generator is more complicated. Thus, the generator must be trained after the discriminator. Once the training starts, the output of the generator (synthesized data) is passed on as input to the discriminator, which attempts to classify the synthesized data as fake or real.

The generator just wants to fool the discriminator, and the discriminator wants to identify the fake data point from the real one clearly.

Applications of GANs for synthetic data generation

GANs have been used for synthetic data generation in a wide range of fields, including:

  • Computer vision: GANs can be used to generate synthetic images, videos, and 3D models. This data can be used to train machine learning models for object detection, image segmentation, and image classification tasks.
  • Medical imaging: GANs can be used to generate synthetic medical images, such as MRI and CT scans. This data can be used to train machine learning models for tasks such as medical diagnosis and treatment planning.
  • Natural language processing: GANs can be used to generate synthetic text, such as news articles, code, and poems. This data can be used to train machine learning models for tasks such as machine translation, text summarization, and question answering.
  • Other applications: GANs have also been used to generate synthetic data for other applications such as fraud detection, financial forecasting, and cybersecurity.

GANs are also used for:

  • Image-to-image translation tasks such as translating photos of summer to winter or day to night
  • Anime character creation
  • Old photo restoration
  • Super Resolution
  • Generating Realistic Photographs
  • Learning the distribution of tabular data and generating similar kinds of data

As an example, refer to the image below, where using the Mona Lisa portrait, GANs can help to generate different facial expressions of the portrait itself.

Tips for synthetic data generation using GANs

  • Choose a GAN architecture that is appropriate for the type of data you want to generate. There are many different types of GANs, each designed for a specific task. For example, there are GANs for generating images, GANs for generating text, and GANs for generating medical images.
  • Use a high-quality training dataset. The quality of the synthetic data generated by a GAN depends on the quality of the training dataset. It is important to use a training dataset that is representative of the real data that you want to generate.
  • Train your GAN carefully. GANs can be challenging to train, and it is essential to be patient and persistent. There are many resources available online that can help you train your GAN.
  • Evaluate your synthetic data. Once you have trained your GAN, evaluating the synthetic data it generates is essential. Make sure the synthetic data is realistic and representative of the real data you want to generate.

Data science teams at Impetus are experts in using GANs to generate synthetic data that replicates the variety and veracity of production data. This expertise has helped our clients overcome data privacy and availability challenges and accelerate their machine learning model development and deployment. To know more, write to us at inquiry@impetus.com

Authors

Nehaa Bansal
Module Lead Analytics Engineer, Data Science

Samarth Tibdewal
Analytics Engineer, Data Science

Choose a lab aligned to your Data & AI journey

Address your desired use case across critical analytic dimensions

  • Explore architecture options with experts

  • Ensure strategic alignment of business and technology

  • Architect an ideal solution for a pressing problem


  • Validate new or refactor existing architecture

  • Develop a prototype with expert guidance

  • Establish a roadmap to production


Learn more about how our work can support your enterprise