I’ve enjoyed learning about AI Engineering and the many technologies that make up this field. For three months, it’s felt like I’m discovering a new concept every day. I know it’s almost impossible to learn everything you come across, so I’ve learnt to give myself grace. Something I’ve ‘discovered’ recently has been the Machine Learning (ML) sub-field. I mean, I’ve heard the term in the past, but for obvious reasons, I never thought too much about it. But I’m taking AI seriously now, and I see how ML, together with deep learning, neural networks, and many other concepts and terminologies, work together to make the AI industry what it is today.
This may be a little vain (yeah, just a little), but I’ve found ML—for want of a better word—sexy. The idea, the practical application, the effect, and the real-world use case have me honestly smitten. I want to know more about this—you know, take a peek under the hood.
So I’m researching ML concepts, and I come across the Lex Friedman podcast on YouTube where he interviewed Ian Goodfellow, who is essentially the father of Generative Adversarial Networks (GANs). That hour-long video gave me real joy and plenty of insight into a concept I hadn’t come across all my life up until that point. One really amazing thing about that interview was Goodfellow talking about how the idea of GANs hit him during a drunken conversation with friends at a bar in 2014. I’m just going to say I know where I’d be tonight, haha. Ian Goodfellow and his team at the University of Montreal also wrote an original paper introducing GANs to the world.
Naturally, I went down a GANs rabbit hole and boy, did I learn a lot. I even tried my hand at a project so I could practice some of the things I was learning about. In this piece, I’ll be telling you about GANs at their core level and some of the important techniques and concepts. I’ll also do a walkthrough/explainer on the project I took on and the thinking that guided that.
What is a GAN?
A generative adversarial network is a particular type of machine learning model that is trained on some sets of data, like images, audio, or texts, to make them look real. Most machine learning models were discriminative before GANs. What this means is that these models were mostly used for classification or regression tasks. GANs “changed the game”, ushering in the era of creativity in machine learning.
Prominent figure in the field of artificial intelligence and machine learning, and the Chief Scientist at Meta AI, Yann LeCun, described GANs as “the most interesting idea in the last 10 years in Machine Learning” — I agree. At the heart of GANs is a deceptively simple idea: generate samples that look realistic by pitting two neural networks against each other in a game of deception vs detection.
When a GAN creates a new image, e.g a cat, it’s using a neural network to produce a cat that has never existed before. It’s not like you’d be compositing different photos of cats together and then end up, for example, having the final image of the cat you generate take the eye off one cat and the ear off another. This is a more digestible process where you train the neural network on a lot of data, and it generates images of entirely new cats using a probability distribution of the data it has.
Remember how I mentioned earlier how the idea behind GANs is putting two neural networks together? Yes, GANs at their core have two models playing a competitive game:
- The Generator (G): takes random noise (numbers) and tries to produce something that looks real. An example could be an image that looks like a photo of a house. In this great game, imagine the generator as a counterfeiter trying to print fake money.
- The Discriminator (D): looks at both real (actual photos of a house) and fake data (the generator’s output) and decides whether it’s fake. Imagine the discriminator is the police and is trained to spot counterfeit money.
How GANs Work?
This is how GANs work.
The generator first produces nonsense because it has no idea what real data looks like. Naturally, this results in the discriminator easily spotting the fakes. Over time, the generator learns how to fool the discriminator better, and the discriminator adjusts, adapts and gets sharper at spotting better fakes. This back-and-forth ‘duel’ makes the generator’s outputs more and more indistinguishable from real data. In the end, we end up with websites like this one where a totally fake—but very realistic—human face pops up on every refresh.
Basically, this is like a cat-and-mouse game. The generator in this instance is the mouse, and it keeps trying to sneak in fakes. The discriminator is the cat that gets sharper at catching these fakes. This duel is the essence of Generative Adversarial Networks (GANs).
Application of GANs
- Image Translation: With GANs, you can translate sketches to photos, black and white images to colored ones, old houses to modern duplexes.
- Art & Design: They are used for AI-generated paintings, visual concepts, music covers, fashion styles, etc.
- Data Augmentation: Create synthetic training data for other AI models for various applications, e.g medical imaging.
- Super-Resolution: Used to upscale blurry images
- Deepfakes: GANs can produce highly realistic fake images or videos. This is a controversial use case.
GANs are used a lot in generative design, and companies like Adobe use them to build Photoshop tools, IBM uses them for data augmentation, Google uses them for text generation, while social media platforms like Instagram, Snapchat, and TikTok use them to create image filters.
The Math Behind GANs
- The Minimax Value Function (V(D, G))
This minimax value function is a core mathematical representation of the GAN framework that both networks compete over:
The Minimax Value Function
min_G max_D V(D, G) = E_{x~p_data(x)} [log D(x)] + E_{z~p_z(z)} [log (1 - D(G(z)))]
The minimax value function above is at the heart of GANs.
- E{x~pdata(x)} [log D(x)]: This represents the expectation over real data samples. The discriminator (D) tries to maximize this term by outputting a high probability (close to 1) that the real data (x) is real.
- E{z~pz(z)} [log (1 – D(G(z)))]: This represents the expectation over noise distribution. The generator (G) works to minimize this term by producing fake data (G(z)) that the discriminator (D) assigns a low probability to (close to 0), making log(1 – D(G(z))) approach log(1) = 0. Put simply, it works to minimize “fake images labeled as fake.”
- Discriminator Loss
The discriminator’s goal is to correctly classify real data as real and generated data as fake. It maximizes the value function this way. The loss here can be represented as:
J_D = - (1/m) Σ log D(x_i) - (1/m) Σ log(1 - D(G(z_i)))
- D(x_i): Discriminator’s prediction for real data sample.
- D(G(z_i)): Discriminator’s prediction for generated sample.
- Log (1 – D(G(z_i))): The penalty incurred for wrongly labeling fake data as real.
- Generator Loss
The generator’s goal is to correctly minimize fake data labeled as fake. This means that it wants to fool the discriminator into believing its generated samples are real. It maximizes the value function this way. The loss function can be represented as:
J_G = - (1/m) Σ log(1 - D(G(z_i)))
- J_G: Generator’s loss.
- G(zi): Fake data generated from a noise vector zi.
- D(G(z_i)): Probability that the discriminator assigns to the generated fake data. The generator aims to make this probability as close to 1 as possible.
Why GANs Are Hard to Train
GANs are famously unstable because of the following challenges:
- Mode Collapse: The generator learns to produce only a few varieties of data (in my case, images), resulting in every generated image looking the same.
- Training Imbalance: If the discriminator becomes too strong, the generator can’t learn. If the generator becomes too strong, the discriminator becomes useless. From Goodfellow’s paper, the discriminator “D must be synchronized well with G during training (in particular, G must not be trained too much without updating D, to avoid “the Helvetica scenario” in which G collapses too many values of z to the same value of x).”
- Gradient Issues: The generator can not update properly because some gradients vanish or explode.
- Hardware Issues: This one is from personal experience. The computational resource needed to train GAN models is a lot (it depends on how large your datasets are). A CPU would struggle badly, whereas a GPU is ideal. So the latter is more advisable to use. In my case, I used CPU, and it gave me all sorts of issues fine-tuning data sets and running epochs.
Common Fixes
Over the years, Researchers have invested in tricks like:
- Wassertein loss (WGAN) for more stable training
- Spectral Normalization, which keeps the discriminator under control
- Feature matching that allows the generator to learn a richer variety
- Careful balancing of batch size, learning rate, and architecture
GANs need constant tuning as they are powerful but temperamental. In other words, you’d have to tune a lot before your model can reach the evaluation metrics they need for accuracy.
Types of GANs
- Deep Convolution GAN (DCGAN) 2015 — The first type of GAN, this uses convolutional neural networks for both the generator and discriminator to generate realistic images. The generator creates images from random noise, and the discriminator tries to differentiate between the images that are real and fake. Both networks improve through adversarial training, leading the generator to produce high-quality synthetic images.
- Wasserstein Generative Adversarial Networks (WGAN) 2017 — This uses the Wasserstein distance, aka Earth Mover’s distance, as a loss function in measuring the difference between generated and real data. WGAN uses this distance to provide a more stable and informative gradient to the generator, which leads to less mode collapse and more stability in the training process.
- Cycle Generative Adversarial Network (CycleGAN) 2017 — This enabled unpaired image-to-image translation (converting an image from one domain to another without needing corresponding pairs of images). CycleGAN uses two generators, two discriminators, and a cycle-consistency loss to ensure that when an image is translated to a new domain and back, it results in a reconstruction that’s close to the original image. For example, even without paired training data, we could use this to transform images like horses to zebras, old houses to modern houses.
- Style Generative Adversarial Network (StyleGAN) 2018-2020 — This is NVIDIA’s breakthrough for generating extremely high-quality, high-resolution faces and images. It introduces a modular generator architecture with a mapping network, an intermediate latent space, and style vectors for better control of image features. StyleGAN improves on previous GAN models and is used in applications that include AI art generation, digital design, and medical imaging research.
- BigGAN (2018) — Developed by DeepMind and Heriot-Watt University, this scaled GANs massively for ImageNet classes because of its ability to generate high-resolution, high-fidelity, and diverse images from large datasets. Some key innovations with BigGAN include the use of large batch sizes, self-attention layers for improved image quality and diversity, and the “truncation trick” for controlling image fidelity. All these make it a state-of-the-art model for realistic image synthesis.
Evaluation Metrics
GANs do not have ‘accuracy’ like classifiers. They are instead judged on quality and diversity. The following are metrics used in evaluating the accuracy of GANs:
- FID (Fréchet Inception Distance): It compares distributions of real vs generated images using features extracted by an ImageNet-retrained Inception-V3 network. The lower the FID, the better. A perfect score of 0 means identical image sets. It’s like comparing the “average vibe” of two photo albums. Basically, you’re checking for how close the generated photo album feels to the real one in terms of looks.
- KID (Kernel Inception Distance): It measures the quality and diversity of generated images by comparing their feature representations by calculating the Maximum Mean Discrepancy (MMD) using polynomial kernels. While it is similar to FID, KID provides an unbiased estimate of the distance between distributions, which makes it more suitable for smaller sample sizes. A lower KID score means images of higher quality and diversity have been generated.
- LPIPS (Learned Perceptual Image Patch Similarity): This measures the perceptual closeness between two images within a pre-trained deep neural network. It considers if two images look alike to humans, aligning better with human perception than other traditional methods like Structural Similarity Index Measure (SSIM) or Peak Signal-to-Noise Ratio (PSNR). A low LPIPS score indicates that the images are perceptually similar.
- Human Evaluation: This remains the gold standard. It is basically asking people, “Which image looks more real?”
Why I Chose This Project
I knew the applications of GANs included (but were not limited to) creating hyper-realistic faces, imaginary landscapes, and even deepfakes of celebrities. But I wanted something that I could relate to. Something close to home and grounded in an African context.
Two ideas stood out:
- AfroCover: African music is global, but many upcoming artists can’t afford professional designers to help with their album art/covers. So I thought, what if a GAN could generate unique, Afro-inspired album art? It doesn’t have to be the complete cover design—even a gradient look and feel would have been job done in this situation. So a StyleGAN trained to generate Afro-inspired music cover art stood out as the main idea here.
- Lagos2Duplex: There is a lot of ageing housing stock in Lagos, Nigeria. If you take a walk through the city, there are many instances where you’d find colonial-era-looking bungalows, with their paint peeling like forgotten history. And just a few streets away, glass-and-steel duplexes rise. I’ve always found this sharp contrast in the Lagos housing scene a little jarring. The general acceptance of the clear divide along status and class, regardless of the neighborhood, in many cases.
So I thought, what if a CycleGAN trained to translate old houses into modern duplex concepts could imagine these transformations?
Could we take a photo of an old Lagos house and instantly see a modern duplex version, not drawn by an architect but generated by an algorithm?
These are what is possible with Generative Adversarial Networks, and so for this project, I built LagosGAN, a two-part experiment in synthetic creativity. The goal with this monorepo containing these projects was not to replace designers or architects, but to explore how AI might act as a sketch partner by providing visual starting points that spark creativity and conversation.
Data & Ethics
The soul of any GAN is data. To build the dataset for both models, I curated licensed and openly available African album art for AfroCover. For Lagos2Duplex, I collected photos of old Lagos houses and modern duplexes. To achieve this, I built a script that crawled some important platforms. The result of this scraping wasn’t 100% accurate, but I managed to gather almost 3000 images, enough to pick a core sample size from.
Every dataset entry was checked for usage rights and had its source Metadata tagged to it. To keep the boundaries clear, I built Dataset Cards that documented sources, licenses, and limitations. Let me be explicit in this article: these images are concept art, not construction drawings.
Building LagosGAN
When I started the project, I had a simple question in mind: What does AI-generated creativity that celebrates African culture rather than just remixing Western datasets look like? The start of the answer to this question was a single workspace hosting two very different but complementary GAN projects:
AfroCover (StyleGAN2-ADA)
With this project, I curated roughly 1,200 album covers, cleaned them up to 256×256 resolution, and fine-tuned a StyleGAN-Adaptive Discriminator Augmentation (ADA) Generator to stabilize learning. StyleGAN is perfect for this because it is famous for generating ultra-sharp, high-resolution images. It works by controlling “styles” at different layers of the network. Things like color palettes, patterns, and textures.
A few practical notes:
-
Training Target: Was 256×256 imagery, 100 epochs (could only get up to 10 epochs because of hardware issues that played a pivotal part in the training), Adam optimizer (0.002 learning rate).
-
Metrics: I tracked FID and LPIPS in Weights & Biases (wandb). A CPU-only run landed around FID ≈ 464—which is good enough to demo. I noted, though, that it’ll take extended GPU time to hit the <60 target in the roadmap.
-
Challenge: When training a small dataset, the model first produced noisy blobs. But after tuning the augmentation strength, it began producing vibrant cover-like images, with some wild and some strikingly usable.
In the Gradio demo, the AfroCover tab samples from the generator. Every time you hit “Generate,” a fresh hybrid of vibrant palettes, bold typography, and geometric motifs is produced. Under the hood, the model loads its weights straight from the Hugging Face model repo (theelvace/afrocover/latest.pt), which makes updates as simple as pushing a new checkpoint.
Lagos2Duplex (CycleGAN)
I chose CycleGAN for the Lagos2Duplex project and built:
- Two generators: Lagos—Duplex and Duplex—Lagos
- Two PatchGAN discriminators for each domain (A & B)
The setup had Domain A = old Lagos houses, and Domain B = modern duplexes. There was a heavy reliance on cycle-consistency loss (set to 10× the adversarial loss), ensuring that A → B → A reconstructs the original.
-
Dataset: I curated ~1,000 images per domain. Domain A covers older Lagos residential styles, while Domain B features contemporary duplex facades. In every epoch, each image goes through resizing, normalization, and augmentation. This helped to give the model enough variety.
-
Training: I earmarked 200 as the number of epochs to run, but could only complete 15 epochs because of hardware issues. Adam optimizer was set at a 0.0002 learning rate. I tracked adversarial losses (train/lossG, train/lossD_*), cycle losses, and an LPIPS perceptual distance metric. WandB’s generated image grids helped me visually compare runs; the run with the lowest cycle losses and clean visuals became the production checkpoint.
-
Output: Users can upload a Lagos house photo via the Gradio interface. The app then reroutes the photo through the deployed generator and produces a duplex concept while still hinting at the original building’s structure. In this project’s case, the results were perfect because of the already mentioned hardware difficulties, but you can still see the vision with the outputs.
-
Challenge: Hardware (for the umpteenth time lol) and early runs warping windows and smearing rooftops. Adding identity loss and switching to CUT (Contrastive Unpaired Translation) improved the structure.
Note: I’ve had a harder time training the Lagos2Duplex model, and I will update this article with a clear image of a successful run soon.
The Monorepo Glue
Everything, ranging from the data prep scripts, training configs, checkpoints, and docs, lives in one report, so that shipping updates can be done quickly. Some highlights in this repo include:
- afrocover/ and lagos2duplex/ host the training code, configs, and evaluation scripts like FID, LPIPS, etc.
- docs/model_cards.md spells out architecture specifics, datasets, limitations, and ethical considerations for each model.
- demo/app.py is the shared Gradio interface for both AfroCover and Lagos2Duplex. It allows for auto-downloading weights from Hugging Face, and exposes controls like “style seed” or “number of covers.”
- scripts/ contains helper utilities like data preprocessing and short training/eval runs.
- Hugging Face hosts the production checkpoints:
- theelvace/afrocover/latest.pt
- theelvace/lagos2duplex/latest.pt
- The Gradio app loads them on boot, so to refresh just means restarting the Space.
- A history of all training runs is kept by Weights & Biases. Sorting runs using VAL/FID or VAL/LPIPS allows me to spot the best-performing checkpoints and pair them with generated galleries to see how the numbers translate into visuals.
Evaluation
Working on this project has made me realize how tricky evaluating GANs is. Accuracy here doesn’t make sense. Instead, metrics that capture realism and diversity are used. I already talked about some of these metrics—FID, KID, LPIPS. These were what I tracked, but numbers only tell part of the story. With this in mind, I ran a small human test by showing the outputs to some people and asking two questions:
- Does this look real?
- Would you believe this is an album art/duplex mockup?
For AfroCover, 75% of respondents said they could see one of the covers being used for an album/song cover. 40% said the duplexes looked “real enough to spark ideas.”
Not perfect, but promising.
Lessons Learned
- I’ve learned how fragile GAN training can be, how you need plenty of computing power (GPUs over CPUs, as I’ve come to find out the hard way), and how essential ethical data sourcing is.
- GANs respond to curation. AfroCover’s output only feels authentic because the dataset is rich in real African art, not scraped Western covers.
- Metrics + visual inspection matter with GANs. FID, LPIPS, and cycle loss pointed me towards which runs to trust, but the image grids convinced me when a checkpoint was usable.
- Lean runtime is the way to go. Shipping the app without baked-in checkpoints (which can be quite large) and forcing weight downloads on startup keeps the repo manageable and updates painlessly.
Conclusion
I decided to learn about GANs and how they work because I thought they were a very pivotal part of the evolution of the AI/ML field. The podcast that features Ian Goodfellow also had me sold. It’s been both a very frustrating and very fun experience building LagosGAN. This was less about chasing state-of-the-art metrics and more about exploring what happens when a technology like GANs is put to work on African problems.
Hugging Face Space — https://huggingface.co/spaces/theelvace/lagos-gan-demo