By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: The Limits of LLM-Generated Unit Tests | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > The Limits of LLM-Generated Unit Tests | HackerNoon
Computing

The Limits of LLM-Generated Unit Tests | HackerNoon

News Room
Last updated: 2025/10/24 at 12:02 PM
News Room Published 24 October 2025
Share
SHARE

The OpenAI Codex documentation includes a simple example prompt:

Write unit tests for utils/date.ts.

It sounds effortless – just ask Codex to write tests, and it will. And in most cases, it does: the tests compile, run, and even pass. Everyone seems satisfied.

But this raises a crucial question: are those tests actually good?

Let’s take a step back and think: why do we write tests? We use tests to check our code against the requirements. When we simply ask an LLM to write tests, are we sure the LLM knows all those requirements?

If no additional context is provided, all the LLM has is the code and, at best, inline documentation and comments. But is that enough? Let’s check with several examples. To illustrate, let’s start with a simple specification.

Requirements

Imagine that we have the following requirements:

  • We need to implement a new Product Service in the service layer of our application.
  • The service should have a method to retrieve the product price by product ID.
  • If the product ID is empty, an exception should be thrown with code 0.
  • The method should retrieve the product by ID from the database (using the product repository service).
  • If the product is not found, another exception should be thrown with code 1.
  • The product price should be returned.
  • The Product entity also has: ID, name, price, and cost price.

We will use PHP as an example, but the conclusions of this article are applicable to all languages.

Baseline Implementation

The following classes make up our starting point:

final class ProductService
{
    public function __construct(private ProductRepository $repository)
    {
    }

    /**
     * Returns the product price or throws on error.
     *
     * @throws EmptyProductIdException  When product ID is empty (code 0)
     * @throws ProductNotFoundException When product is not found (code 1)
     */
    public function getProductPrice(string $productId): float
    {
        $productId = trim($productId);
        if ($productId === '') {
            throw new EmptyProductIdException();
        }

        $product = $this->repository->findById($productId);
        if ($product === null) {
            throw new ProductNotFoundException($productId);
        }

        return $product->getPrice();
    }
}

Notice that the getProductPrice method is documented with a straightforward docblock describing its return value and expected exceptions.

The following supporting classes are not central to the article but are included for completeness. Feel free to skip them if you’re focusing on the main idea.

final class Product
{
    public function __construct(
        private string $id,
        private string $name,
        private float $price,
        private float $costPrice
    ) {
    }

    public function getId(): string
    {
        return $this->id;
    }

    public function getName(): string
    {
        return $this->name;
    }

    public function getPrice(): float
    {
        return $this->price;
    }

    public function getCostPrice(): float
    {
        return $this->costPrice;
    }
}
final class ProductNotFoundException extends RuntimeException
{
    public function __construct(string $productId, ?Throwable $previous = null)
    {
        parent::__construct("Product not found: {$productId}", 1, $previous);
    }
}
final class EmptyProductIdException extends InvalidArgumentException
{
    public function __construct(string $message="Product ID is empty", ?Throwable $previous = null)
    {
        parent::__construct($message, 0, $previous);
    }
}

Baseline Tests

Let’s use OpenAI Codex to generate some tests for our project. I used Codex v0.25 with the GPT-5 high model.

For this baseline, I deliberately took the naive approach and asked:

▌Write unit tests for all public methods of the ProductService

Codex generated reasonable tests: two successful cases and two unsuccessful cases. It even inferred from the docblock that exceptions should carry specific error codes.

final class ProductServiceTest extends TestCase
{
    public function testGetProductPriceReturnsPriceForExistingProduct(): void
    {
        $repo = $this->createMock(ProductRepository::class);
        $product = new Product('p1', 'Product 1', 19.99, 10.0);

        $repo->expects($this->once())
            ->method('findById')
            ->with('p1')
            ->willReturn($product);

        $service = new ProductService($repo);
        $price = $service->getProductPrice('p1');

        $this->assertEqualsWithDelta(19.99, $price, 0.0001);
    }

    public function testGetProductPriceTrimsIdAndReturnsPrice(): void
    {
        $repo = $this->createMock(ProductRepository::class);
        $product = new Product('ABC', 'Trimmed Product', 5.5, 2.0);

        $repo->expects($this->once())
            ->method('findById')
            ->with('ABC') // ensure the ID is trimmed before lookup
            ->willReturn($product);

        $service = new ProductService($repo);
        $price = $service->getProductPrice('  ABC  ');

        $this->assertEqualsWithDelta(5.5, $price, 0.0001);
    }

    public function testGetProductPriceThrowsOnEmptyId(): void
    {
        $repo = $this->createMock(ProductRepository::class);
        $service = new ProductService($repo);

        $this->expectException(EmptyProductIdException::class);
        $this->expectExceptionCode(0);
        $this->expectExceptionMessage('Product ID is empty');

        $service->getProductPrice('   '); // becomes empty after trim
    }

    public function testGetProductPriceThrowsWhenProductNotFound(): void
    {
        $repo = $this->createMock(ProductRepository::class);

        $repo->expects($this->once())
            ->method('findById')
            ->with('missing')
            ->willReturn(null);

        $service = new ProductService($repo);

        $this->expectException(ProductNotFoundException::class);
        $this->expectExceptionCode(1);
        $this->expectExceptionMessage('Product not found: missing');

        $service->getProductPrice('missing');
    }
}

At first glance, these tests look solid. But that’s also the limitation: they only reflect what’s written in the code and comments. Do they really guard against deeper mistakes?

Mutations

To explore this, I deliberately mutated the ProductService code and asked Codex to generate tests again. Each time, I deleted the existing tests, restarted Codex to clear the context, and used the same prompt: “Write unit tests for all public methods of the ProductService.”

Case 1 — Flipped Empty Condition

In this mutation, I inverted the condition that checks whether the product ID is empty:

-        if ($productId === '') {
+        if ($productId !== '') {

Codex wrote unit tests equivalent to the previous ones, with only minor differences in line order, product names, and similar details. The tests were correct: Codex identified the error in the code, wrote appropriate tests, and even proposed a fix for the method:

Why Tests Fail

- In src/Service/ProductService.php:
    - The condition is inverted:
    - Current: `if ($productId !== '') { throw new EmptyProductIdException(); }`
    - Intended: throw when empty/whitespace after trim.
- This causes:
    - Non-empty IDs to wrongly throw EmptyProductIdException.
    - Empty/whitespace IDs to continue and hit the repository call unexpectedly.

Suggested Fix (not applied)

- In src/Service/ProductService.php, change the condition to:
    - if ($productId === '') { throw new EmptyProductIdException(); }

This suggests that Codex is able to handle at least obvious logical inversions.

Case 2 — Flipped Condition During Product Search

In this mutation, I inverted the condition that checks whether the product exists in the repository:

-        if ($product === null) {
+        if ($product !== null) {

The outcome was the same as in Case 1: Codex produced correct tests and proposed a fix for the bug.

Case 3 — Return Cost Price Instead of Price

In this mutation, I changed the method to return the product’s cost price instead of its price:

-        return $product->getPrice();
+        return $product->getCostPrice();

Codex again generated four tests, but it did not realize that the method must return the product price rather than the cost price – even though this is stated in the docblock. The generated suite confirmed the mutated behavior instead of the requirement.

...
    public function testGetProductPriceReturnsCostPriceWhenFound(): void
    {
        $repo = $this->createMock(ProductRepository::class);
        $product = new Product('p1', 'Product 1', /* $price */ 19.99, /* $costPrice */ 7.49);
        $repo->expects($this->once())
            ->method('findById')
            ->with('p1')
            ->willReturn($product);

        $service = new ProductService($repo);

        $price = $service->getProductPrice('p1');

        $this->assertSame(7.49, $price);
    }
...

Unlike the earlier mutations, this one slipped through: Codex followed the code and accepted the cost price as if it were correct.

Single Coding Session Run

Before making conclusions, let’s run another experiment. This time, I asked Codex to write the code itself first, so it had more context before generating tests. All the following prompts were made within a single Codex session:

▌Create a PHP file containing a PHP class. The class is called ProductService. 
It has a method called getProductPrice. It receives a product ID as a string and checks that it is not empty. 
If it is empty, throw an exception with error code 0 (empty). 
It then retrieves the product by ID from the database using another service, ProductRepository. 
If the repository returns null, throw an exception with error code 1 (not found). 
Then get the product price. (Create a simple Product class with only ID, name, price, and costPrice.) 
All classes should be in their own files. File structure: an src folder containing 
Domain, Exception, Repository, and Service directories.

Codex produced a similar ProductService, though it didn’t add a docblock for the getProductPrice method. This made it even more interesting to see how it would handle mutations:

...
    public function getProductPrice(string $productId): float
    {
        if (trim($productId) === '') {
            throw ProductException::emptyId();
        }

        $product = $this->productRepository->findById($productId);

        if ($product === null) {
            throw ProductException::notFound($productId);
        }

        return $product->getPrice();
    }
...

First, I asked Codex to write tests. No surprises here: four correct unit tests were written, including checks for exception error codes in the negative cases.

Then I mutated the service in the same way as before. The only difference was that I slightly modified the prompt to make Codex understand there were no tests anymore:

▌Check whether tests for all public methods of ProductService still exist, and write them if they are missing  

Codex successfully handled the inverted conditions: bugs were fixed automatically, and correct tests were generated.

And the most interesting part: the same happened when I replaced getPrice with getCostPrice:

Fixes Made

- Restored missing test file tests/Service/ProductServiceTest.php.
- Corrected ProductService::getProductPrice to return $product->getPrice().

So, as expected, even without additional context from a docblock, Codex was able to generate correct tests and repair the code, relying on the initial requirements given at the start of the session.

Conclusion

These experiments show that the naive approach to writing tests with an LLM does not deliver the expected results. Yes, tests will be generated — but they will simply mirror the current code, even if that code contains bugs. An LLM can identify obvious logic errors, but when the code involves complex business rules or formulas, the generated tests will not meet the goals of unit testing.

Here are a few practical lessons:

  • Provide more context. Add inline comments and documentation blocks before generating tests. This may help, but it still cannot guarantee correct unit tests or meaningful bug detection.
  • Write code and tests in the same session. If the LLM writes the code and the tests together, it has a better chance of enforcing the original requirements, as the single-session run demonstrated.
  • Review everything. Unit tests from an LLM should never be committed blindly — they require the same careful review as hand-written tests.

LLMs can certainly help with testing, but without clear requirements and human review, they will only certify the code you already have — not the behavior you actually need.

:::warning
Disclaimer: Although I’m currently working as a Lead Backend Engineer at Bumble, the content in this article does not refer to my work or experience at Bumble.

:::

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Inside the Messy, Accidental Kryptos Reveal
Next Article Trump 'terribly advised' on crypto billionaire pardon: Key supporter
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Ring doorbells receive major upgrade that catches intruders even faster
News
Make Class-Agnostic 3D Segmentation Efficient with 3DIML | HackerNoon
Computing
YouTube TV could lose ESPN, Disney soon – both sides typically blame each other | Stuff
Gadget
Descend Into a Psychedelic Gothic Nightmare in This Cult Horror Movie
News

You Might also Like

Computing

Make Class-Agnostic 3D Segmentation Efficient with 3DIML | HackerNoon

12 Min Read
Computing

Amazon and the media: Inside the disconnect on AI, robots and jobs

9 Min Read
Computing

The Latest Sheaves Work To Hopefully Improve Linux Performance

2 Min Read
Computing

Douyin launches ride-hailing service, partners with Gaode Maps · TechNode

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?