Synthetic Data Generation Using GANs

Generative Adversarial Networks (GANs) are a powerful machine learning technique for generating synthetic data that is indistinguishable from real data. GANs have been used to generate synthetic images, text, audio, and video and have applications in a wide range of fields, including healthcare, finance, and security.

GANs work by pitting two neural networks against each other: a generator and a discriminator. The generator’s goal is to create synthetic data that is as realistic as possible, while the discriminator’s goal is to distinguish between real and synthetic data. The generator and discriminator are trained simultaneously, and over time, the generator learns to create increasingly realistic synthetic data.

This blog will explore the fundamentals of GANs and their application to synthetic data generation that our Data Science team has been working on.

Why do we need GAN?

Data teams all over the world face a dilemma: whether to use production-ready data or generate synthetic data for testing. Using production data can lead to the loss of sensitive customer information, which can be overcome using synthetic data instead. GANs help to generate synthetic data, reducing the security risk of losing confidential client data at minimal cost. With numerous GAN models available on the market, it has decreased the time to market for the newly generated data. The synthetic data generated from a GAN model is of high quality with data distribution like production-ready data.

GANs help synthesize data in the local deployment environment and can be extended to any cloud service.

How does it work??

GANs are a deep-learning-based generative model and have two sub-models:

The generator model, which we train to generate new examples
The discriminator model, which classifies examples as real (from the domain) or fake (generated)

The two models are trained together in the following way:

The generator generates perfect replicas from the input domain every time
The discriminator successfully identifies real and fake samples

When the generator fools the discriminator, it is rewarded, or no change is needed to the model parameters, but the discriminator is penalized, and its model parameters are updated.

GAN model architecture

Let’s deep dive into the GAN architecture:

The Discriminator

The task of the discriminator is to identify between real and fake data. To become proficient, it is trained on two data inputs —

The generator-produced data (which we can call as fake)
The given data (which we can label as real)

Let’s say the generator synthesized (fake) data is labelled as ‘0’ and the real data is labelled as ‘1’. The discriminator then processes this data, predicting either a ‘0’ or ‘1’ on the examples it sees.

Real images of numbers | Fake images of numbers

As in the above example, the discriminator needs to become good at correctly identifying the right group as the fake numbers and the left group as the real ones.

More technically, the discriminator will return the probability that a given example is a real example. If this probability is above a certain threshold (for example, 0.5), the discriminator will determine the example to be real and return 1. Otherwise, the discriminator will return 0.

Note that training a discriminator is a supervised learning task. We explicitly provide target labels as ‘0′ and ‘1’ to the discriminator.

The Generator

The generator network takes random data as input (mathematically, we can think of it as an n-dimensional vector derived from a latent space) and transforms this data to generate examples that can fool the discriminator.

The performance of the generator depends upon the quality of the discriminator. Hence, training the generator is more complicated. Thus, the generator must be trained after the discriminator. Once the training starts, the output of the generator (synthesized data) is passed on as input to the discriminator, which attempts to classify the synthesized data as fake or real.

The generator just wants to fool the discriminator, and the discriminator wants to identify the fake data point from the real one clearly.

Applications of GANs for synthetic data generation

GANs have been used for synthetic data generation in a wide range of fields, including:

Computer vision: GANs can be used to generate synthetic images, videos, and 3D models. This data can be used to train machine learning models for object detection, image segmentation, and image classification tasks.
Medical imaging: GANs can be used to generate synthetic medical images, such as MRI and CT scans. This data can be used to train machine learning models for tasks such as medical diagnosis and treatment planning.
Natural language processing: GANs can be used to generate synthetic text, such as news articles, code, and poems. This data can be used to train machine learning models for tasks such as machine translation, text summarization, and question answering.
Other applications: GANs have also been used to generate synthetic data for other applications such as fraud detection, financial forecasting, and cybersecurity.

GANs are also used for:

Image-to-image translation tasks such as translating photos of summer to winter or day to night
Anime character creation
Old photo restoration
Super Resolution
Generating Realistic Photographs
Learning the distribution of tabular data and generating similar kinds of data

As an example, refer to the image below, where using the Mona Lisa portrait, GANs can help to generate different facial expressions of the portrait itself.

Tips for synthetic data generation using GANs

Choose a GAN architecture that is appropriate for the type of data you want to generate. There are many different types of GANs, each designed for a specific task. For example, there are GANs for generating images, GANs for generating text, and GANs for generating medical images.
Use a high-quality training dataset. The quality of the synthetic data generated by a GAN depends on the quality of the training dataset. It is important to use a training dataset that is representative of the real data that you want to generate.
Train your GAN carefully. GANs can be challenging to train, and it is essential to be patient and persistent. There are many resources available online that can help you train your GAN.
Evaluate your synthetic data. Once you have trained your GAN, evaluating the synthetic data it generates is essential. Make sure the synthetic data is realistic and representative of the real data you want to generate.

Data science teams at Impetus are experts in using GANs to generate synthetic data that replicates the variety and veracity of production data. This expertise has helped our clients overcome data privacy and availability challenges and accelerate their machine learning model development and deployment. To know more, write to us at inquiry@impetus.com

Authors

Nehaa Bansal
Module Lead Analytics Engineer, Data Science

Samarth Tibdewal
Analytics Engineer, Data Science

Choose a lab aligned to your Data & AI journey

Address your desired use case across critical analytic dimensions

DESIGN LAB

Explore architecture options with experts
Ensure strategic alignment of business and technology
Architect an ideal solution for a pressing problem

Get Started with a Design Lab for $0

BUILD LAB

Validate new or refactor existing architecture
Develop a prototype with expert guidance
Establish a roadmap to production

Get Started with a Build Lab for $20K USD

Learn more about Data & AI Labs

Learn more about how our work can support your enterprise

Explore more resources

Review our services

Get in touch

Cookie	Duration	Description
__cf_bm	1 day	This cookie is used to distinguish between humans and bots. This is beneficial for the website, in order to make valid reports on the use of their website.
_grecaptcha	1 day	This cookie is used to distinguish between humans and bots. This is beneficial for the website, in order to make valid reports on the use of their website.
_GRECAPTCHA	179 days	This cookie is used to distinguish between humans and bots. This is beneficial for the website, in order to make valid reports on the use of their website.
CONSENT	2 years	Used to detect if the visitor has accepted the marketing category in the cookie banner. This cookie is necessary for GDPR-compliance of the website.
li_gc	179 days	Stores the user's cookie consent state for the current domain.
pa_enabled	1 day	Determines the device used to access the website. Th is allows the website to be formatted accordingly.
rc::a	1 day	This cookie is used to distinguish between humans and bots. This is beneficial for the website, in order to make valid reports on the use of their website.
rc::b	1 day	This cookie is used to distinguish between humans and bots.
rc::d-15#	1 day	This cookie is used to distinguish between humans and bots.
test_cookie	1 day	Used to check if the user's browser supports cookies.
visitorId	1 year	Preserves users states across page requests.

Cookie	Duration	Description
_cc_cc	1 day	Collects statistical data related to the user's website visits, such as the n umber of visits, average time spent on the website and what pages have been loaded. The purpose is to segment the website's users according to factors such as demographics and geographical location , in order to enable media and marketing agencies to structure and understand their target groups to enable customised on line advertising.
_gcl_au	3 months	Used by Google AdSense for experimenting with advertisement efficiency across websites using their services.
ads/ga-audiences	1 day	Used by Google AdWords to re-engage visitors that are likely to convert to customers based on the visitor's on line behaviour across websites.
bcookie	1 year	Used by the social networking service, LinkedIn , for tracking the use of embedded services.
bscookie	1 year	Used by the social networking service, LinkedIn, for tracking the use of embedded services.
demdex	179 days	Via a unique ID that is used for semantic content analysis, the user's n avigation on the website is registered and linked to offline data from surveys and similar registrations to display targeted ads.
dpm	179 days	Sets a unique ID for the visitor, that allows third party advertisers to target the visitor with relevant advertisement. This pairing service is provided by third party advertisement hubs, which facilitates real-time bidding for advertisers.
IDE	1 year	Used by Google DoubleClick to register and report the website user's actions after viewing or clicking one of the advertiser's ads with the purpose of measuring the efficacy of an ad and to present targeted ads to the user.
lang	1 day	Set by LinkedIn when a webpage contains an embedded "Follow us" panel.
lidc	1 day	Used by the social networking service, LinkedIn, for tracking the use of embedded services.
lpv#	1 day	Used in context with behavioral tracking by the website. The cookie registers the user’s behavior and navigation across multiple websites and ensures that no tracking errors occur when the user has multiple browser-tabs open.
pagead/1p-user-list/#	1 day	Tracks if the user has shown interest in specific products or events across multiple websites and detects how the user navigates between sites. This is used for measurement of advertisement efforts and facilitates payment of referral-fees between websites.
pixel.gif	1 day	Collects in formation on user preferences and/or interaction with web-campaign content - This is used on CRM-campaign -platform used by website owners for promoting events or products.
site/#	1 day	Unclassified.
ssi	1 year	Registers a unique ID that identifies a returning user's device. The ID is used for targeted ads.
u	1 year	Collects data on user visits to the website, such as what pages have been accessed. The registered data is used to categorise the user's interest and demographic profiles in terms of resales for targeted marketing.
UserMatchHistory	29 days	Ensures visitor browsing-security by preventing cross-site request forgery. This cookie is essential for the security of the website and visitor.
visitor_id#	10 years	Used in context with Account-Based-Marketing (ABM). The cookie registers data such as IP-addresses, time spent on the website and page requests for the visit. This is used for retargeting of multiple users rooting from the same IP addresses. ABM usually facilitates B2B marketing purposes.
visitor_id#-hash	10 years	Used to encrypt and contain visitor data. This is necessary for the security of the user data.
VISITOR_INFO1_LIVE	179 days	Tries to estimate the users' band width on pages with integrated YouTube videos.
w/1.0/cm	1 day	Presents the user with relevant content and advertisement. The service is provided by third-party advertisement hubs, which facilitate real-time bidding for advertisers.
YSC	1 day	Registers a unique ID to keep statistics of what videos from YouTube the user has seen.
yt-remote-cast-available	1 day	Stores the user's video player preferences using embedded YouTube video.
yt-remote-cast-installed	1 day	Stores the user's video player preferences using embedded YouTube video.
yt-remote-connected-devices	1 day	Stores the user's video player preferences using embedded YouTube video.
yt-remote-device-id	1 day	Stores the user's video player preferences using embedded YouTube video.
yt-remote-fast-check-period	1 day	Stores the user's video player preferences using embedded YouTube video.
yt-remote-session-name	1 day	Stores the user's video player preferences using embedded YouTube video.
yt.innertube::nextId	1 day	Registers a unique ID to keep statistics of what videos from YouTube the user has seen.
yt.innertube::requests	1 day	Registers a unique ID to keep statistics of what videos from YouTube the user has seen.
yt.innertube::requests	1 day	Registers a unique ID to keep statistics of what videos from YouTube the user has seen.
ytidb::LAST_RESULT_ENTRY_KEY	1 day	Stores the user's video player preferences using embedded YouTube video.

Cookie	Duration	Description
__utm.gif	1 day	Google Analytics Tracking Code that logs details about the visitor's browser and computer.
__utma	2 years	Collects data on the number of times a user has visited the website as well as dates for the first and most recent visit. Used by Google Analytics.
__utmb	1 day	Registers a timestamp with the exact time of when the user accessed the website. Used by Google Analytics to calculate the duration of a website visit.
__utmc	1 day	Registers a timestamp with the exact time of when the user leaves the website. Used by Google Analytics to calculate the du ration of a website visit.
__utmt	1 day	Used to throttle the speed of requests to the server.
__utmz	6 months	Collects data on where the user came from, what search engine was used, what link was clicked and what search term was used. Used by Google Analytics.
_omappvp	11 years	This cookie is used to determine if the visitor has visited the website before, or if it is a new visitor on the website.
_omappvs	1 day	This cookie is used to determine if the visitor has visited the website before, or if it is a new visitor on the website.
ab	1 year	This cookie is used by the website’s operator in context with multi-variate testing. This is a tool used to combine or change content on the website. This allows the website to find the best variation /edition of the site.
AnalyticsSyncHistory	29 days	Used in connection with data-synchronization with third-party analysis service.
omVisits	1 day	This cookie is used to identify the frequency of visits and how long the visitor is on the website. The cookie is also used to determine how many and which subpages the visitor visits on a website – this in formation can be used by the website to optimize the domain and its subpages.
omVisitsFirst	1 day	This cookie is used to count how many times a website has been visited by different visitors - this is done by assigning the visitor an ID, so the visitor does not get registered twice.
pa	1 day	Registers the website's speed and performance. This function can be used in context with statistics and load-balan cing.
ziwsSession	1 day	Collects statistics on the user's visits to the website, such as the number of visits, average time spent on the website and what pages have been read.
ziwsSessionId	1 day	Collects statistics on the user's visits to the website, such as the number of visits, average time spent on the website and what pages have been read.