Spotting Image Differences In Visual Software Testing With AI

Key Takeaways

Spotting differences between two images is an important task in visual test automation when a screenshot needs to be compared to a previous version or a reference design.

Generative AI based on multimodal language models excels at recognizing and explaining an image’s content, but can only identify differences in aspects they have been explicitly trained on.

This problem is commonly solved by using a convolutional neural network (CNN), by comparing small segments of images (9×9 pixel region) instead of individual pixels.

CNN solutions can be implemented using tools like Tensorflow, PyTorch, and Keras API.

High resolution displays can result in false positives because of displacements by more than a few pixels. To solve this, instead of increasing window size to cover the length of displacements, train the network to yield a boolean equality flag and x and y values of displacement vector.

Comparing two images for structural changes is a task that AI surprisingly struggles with. Generative AI based on multimodal language models excels at recognizing and explaining an image’s content, but can only identify differences in aspects they have been explicitly trained on. Meanwhile, image comparison libraries require a high degree of alignment, often on pixel level, without tolerance for even the slightest of distortions.

Spotting the difference is a relevant task in visual test automation when a screenshot needs to be compared to a previous version or a reference design. Ideally, we want to identify which elements have changed and how. Have things moved, scaled or been replaced? Current technologies in visual test automation fail with either of the just described shortcomings. This article LLM based methods miss even major layout issues with objects that they don’t recognize. Pixel based algorithms report major differences, when only a minor displacement by a few pixels occurred.

Vision in Software Testing

The goal of software testing is to detect deviations from an expected state. The final and most difficult aspect of a program is its eventual appearance on screen. Because our visual perception is extremely flexible, we are able to respond to tiny details and yet subconsciously compensate for major changes in position and coloring. Determining what constitutes a valid rearrangement versus a violation of expectation is therefore extremely difficult to formalize. To avoid these problems, most software tests only check for internal technical states, such as DOM trees. However, with increasing capabilities of AI, the check for how it would appear to a human becomes more and more practical.

State of AI

Consider the following example of two maps. Since many of us played “spot the difference” as kids, we can find a missing street leading directly to the city center within several seconds. Because this task seems so incredibly simple to us it is hard to believe how much trouble it causes to computer vision algorithms. In addition to a removed street the whole map was also moved two pixels to bottom right. This seemingly minor detail makes all pixel based algorithms, such as Pixelmatch, resemble.js, Python Pillow and OpenCV fail. Generative AI models promise a deep understanding of images. Let’s see how they perform in this example.

Figure 1: Two versions of a map. The missing street is quickly detectable by humans, but not by AI.

With multimodal AI models we can directly upload both images and ask for a description of the difference. The analysis was done with Claude 3.7, Claude 4, Gemini 2.5 pro, ChatGPT-o3 and ChatGPT-4o. Claude and Gemmini fail directly and answer that there is no substantial change. ChatGPT is powerful enough to fail only after generating Python code that analyzes the image pair with a number of available computer graphics libraries. Due to slight pixel misalignment between the two maps, all pixel based difference algorithms detect color changes at every edge, making the missing street visible but inconspicuous among all the false positives, such as shown in figure 2. All of the previously mentioned image comparison algorithms produce similar results and are unable to highlight the relevant change.

Figure 2: Many false positives are found by pixel based image comparison algorithms.

The generative AI models reason about the images with impressive detail, recognizing the map’s origin, checking all names, and analyzing the color scheme. Yet, neither of the models identifies the missing connection. In all cases, Claude, Gemmini and ChatGPT explicitly confirmed the equivalence of all routes. In one of the runs, the latest model GPT-o3 summarizes its findings with: “Nothing substantive has changed”, as shown in figure 3.

Figure 3: No reference to the missing street is made by ChatGPT-o3 after 51 seconds of “thinking”.

How do humans do it?

Let’s take a moment to introspect how we, as humans, would approach such a task. You might even want to scroll up and observe how you compared the maps for differences. Most people required about 10 to 20 seconds until finding the missing street.

Before we can start comparing image fragments we need to solve the correspondence problem. Your eyes would move left and right between the images, focusing on individual locations and alternating between various spots in both maps. Once you have found corresponding spots you can start examining the area of interest. Your gaze would not remain on one of the images. Tiny eye movements would allow your retina to fixate alternating positions, effectively overlaying corresponding image regions. Your eye muscles will compensate for the spatial distance between the two images. Color comparisons can already happen in the retina. The resulting signals then travel along the optic nerve to the thalamus, where significant visual features are filtered.

As you repeat this process across the entire image, you might find yourself forming hypotheses – such as “Is that road really missing?” – which prompts you to make an extra effort to verify your assumption. The human performs hypothesis-driven perception: the visual cortex has ten times more connections leading from the cortex to the thalamus (and ultimately back to the eye) than it does in the other direction. This means we dedicate ten times more processing power on designing and refining our tests than on evaluating the resulting signals from the eye.

The procedure described in the previous section is a prime example of a chain of thought. Rather than directly producing a result, we consciously and subconsciously develop a strategy through multiple iterations, failures, and improved conclusions. This is precisely where generative AI struggles. While recent advancements have enabled chain-of-thought reasoning for textual content, applying the same concept to two-dimensional spatial analysis across multiple scales presents an even greater challenge. Generative AI models have learned to process image inputs much later than pure text prompts. Solving all vision related problems will likely take years – possibly even longer when considering that humans themselves only manage this task at the cost of being highly susceptible to optical illusions.

It’s important to note that AI excels at recognizing stimuli that frequently appear in its training set. This includes text, traffic situations, object close-ups, celebrities, and common visual patterns found in IQ tests. However, elements such as geographic maps, aesthetic alignments, and abstract or imaginative figures fall outside this domain – not only due to lack of training data but also because they are difficult to label with words. It is much harder to describe visual differences than derive differences in identified labels. As a result, image comparison remains a challenge and is likely to remain so for some time.

Tolerance to pixel movements

When comparing images, two distinct concepts must be considered. The first is determining whether the image pair is equal. The second is identifying the minimal change necessary to make them equal. In software testing both steps are crucial. The first determines whether a test fails and requires human review, while the second facilitates a quick decision. Knowing that a button “only” moved has different implications than if it moved and changed. Without a reliable report of the actual change software testers fall back to tedious manual comparisons in a “spot the difference” manner.

The simplest method for comparing two images is a pixel-by-pixel approach. This technique was attempted in our previous map example by ChatGPT. However, this method fails when applied to modern user interfaces with dynamic layouts. Pixel stability is not guaranteed – it can vary between versions or even randomly when highly parallelized algorithms are pushing the limits for the fastest rendering engine. As a result, individual pixel colors are insufficient to deduce equivalent content. Spatial displacements require more complex forms of pattern recognition.

A common solution to this problem is the use of a convolutional neural network (CNN). Training a network with entire image pairs would be computationally expensive, as realistic screenshots have high resolutions with millions of pixels. The problem can be simplified by comparing small segments instead of individual pixels. The following example uses a 9×9 pixel region. This is large enough to consider minor displacements when determining equality and small enough for a lightweight neural network. With 9x9x2 = 162 input nodes and one output node, the required network is minimal in size. Since each color channel can be treated separately, it is sufficient to train monochrome samples.

Figure 4: 9×9 segment of the image pairs. The structural equality of image 1 and 3 can be ruled out. For image 1 and 2 it cannot.

The training of such a network is straightforward using Tensorflow (and equally so with PyTorch). The following code example shows an effective network trained from 200k labeled grayscale images. To exploit all symmetries the number of training examples can be increased by a factor of 16, accounting for 3 symmetry axes and commutativity of the two input images. The training procedure is effectively defined in a single call to the Keras API and completes training within a few minutes on stock hardware. 15 years ago this would have been cutting edge AI research. Now it is straightforward machine learning.

There is a Colab notebook available complete with training instructions and training data.

The neural network design loosely follows the original LeNet from 1990 with the only difference being that it is applied to an image that is larger than the input window. Hence, it does not yield a single label, but a label for each point. This does solve the problem with our initial example of two maps that are not perfectly aligned. Figure 5 shows the result of this simple convolutional network with 162 input nodes, 8 layers and a total of 48,211 trainable parameters. All training data and the network layout are available from the linked notebook.

Figure 5: The convolutional neural network can detect the deviating location on the map.

Coping with large displacements

On high resolution displays or with more complex layouts the differences of two images might result from displacements by more than a few pixels. For applications in test automation, such rearrangements typically lead to failed tests. The manual effort of checking all slightly reformatted screenshots for consistency is immense. It would be extremely valuable if an AI tool could correctly detect if substantial changes happened outside such minor adjustments to the layout.

A naive thought would be to just increase the window size of the neural network. Let’s consider we have a network that is able to compensate for shifts of length “n” and we want to scale its ability by a factor of 2. Since the additional displacement can happen in any of two dimensions we need to extend our search to four different segments of the original size. Since we would accept a match with any of these four segments, we are four times as likely to catch false positives. Small regions might just look alike, e.g. contain the same stroke type, but not be the result of a consistent displacement of an entire area. To avoid the deterioration of specificity we would have to compare screen regions of 4 times the size. This gives us a total of 16 times the effort, or a complexity of O(n^4) for an image comparison network that is tolerant to a displacement of length n. This is likely a lower limit, because for larger regions the displacement might not be treated as uniform and more complex distortion scenarios must be considered.

Let’s consider the scenario where the displacement is non-uniform and large. Figure 6 shows an example of the Spot-the-Difference game as played by little children. The skewed display of the second image places little additional effort for humans, because we are used to seeing things from an angle anyways. For the purpose of software testing, such a distortion might seem unrealistic. However, with fluid layouts and ever changing scales and positions, it is not too far from reality either. The relevant takeaway here is that AI totally fails to detect any of the relevant differences at all.

Figure 6: While little children have no problem spotting the relevant differences, AI totally fails.

A possible solution

As discussed previously, increasing the window size to cover the entire length of the displacements is not an option. Instead of training the network to yield a boolean equality flag, we can train it to return the x and y values of the displacement vector. At first, this doesn’t seem to make any difference, as the actual length still does not fit inside the sliding window of computation. However, we can now scale the image down, sacrificing the details necessary for comparison, but in return get the benefit of oversight allowing us to derive an approximation for the displacement direction, narrowing down the search region on finer levels.

The following pseudo algorithm implements a solution to the correspondence problem. It derives a map of vectors pointing to corresponding positions in the left and the right image. The algorithm recursively calls itself with downscaled images. A trained network only needs to adjust for the errors made on the coarser level, i.e. it only needs to find matching features in a very small search region. The algorithm uses the OpenCV framework’s “resize” method to scale the images up and down and “remap” to apply the displacements predicted on coarser levels. It also uses a cnn_predict method to get the neural network’s estimate for the relative displacement. The prediction yields two output channels for the x and y components.


# Returns a tensor dxy, such that for every x,y:
#   img1(y + dxy(y,x,1)/2, x + dxy(y,x,0)/2) corresponds to 
#   img2(y - dxy(y,x,1)/2, x - dxy(y,x,0)/2)
def get_correspondence_map(img1, img2):
  assert_equal(img1.shape, img2.shape)
  if img1.shape > window_shape:
    dxy_1 = resize(         # The course estimate is derived from a
      get_correspondence_map( # recursive call on shrunk images.
        resize(img1, 0.5), 
        resize(img2, 0.5)), 2)    
    dxy_2 = cnn_predict([ # Calling CNN to predict residual displacements.
        remap(img1,  dxy_1),  # Each image is displaced
        remap(img2, -dxy_1)]) # into the other’s direction.
    dxy = 2 * dxy_1 + dxy_2  # Return the sum of coarse and fine shifts
  else:
    dxy = zeros((img1.height, img1.width ,2)) # Too small: assume alignment
  return dxy

This Colab sheet shows the full algorithm with all border cases in Python using OpenCV and a pretrained network to recognize displacements of sizes up to 3 pixels from a 15×11 window with 64,741 parameters.

After reconstructing the displacement map, we apply half the effect on each image. Now the two images meet in the middle and can be compared with standard methods. Obviously, the distortion introduces additional aliasing and the match will not be pixel-perfect. The neural network from the previous section solves this problem out of the box. All relevant differences are detected, with only two false positives in the top right. Needless to say that none of these test images were used in the training phase.

Figure 7: Simple difference algorithms can be applied after compensating for the distortion.

This suggested solution helps to find relevant differences even after the layout changed significantly. It can detect when items move together, but it cannot track individual items that swap places or jump to completely new areas. A human or AI still needs to examine and explain what actually changed. However, by narrowing down where to look, it makes the job much easier.

Conclusion

Comparing two graphical outputs is the main challenge in visual regression testing. While it is easy to detect that two images are different on some pixel level, it is much harder to summarize the findings as, e.g. “logo X has moved by n pixels”. Unfortunately, current generative multi-modal AI models are not up to the task. They are very good at finding differences that they can name, such as text values or orders of buttons, but totally fail when things cannot be named, such as aesthetic alignments, connections on an irregular map or any objects that they have not been trained to tokenize.

To facilitate image comparisons two hand-made solutions were suggested. Training a CNN to compare image segments with tolerance to minor displacements can help to relieve the necessity for perfect pixel alignment. To detect and compensate for large distortions an algorithm was shown that performs such a task by operating on multiple image scales. This is similar to how the human visual cortex probably handles such a task. It operates on multiple scales, derives assumptions and compensates for the effect. Eventually, the residual change can be focused on, separating shift and non-shift differences. Since comparing still images is just the tip of the iceberg of visual processing, I expect AI technologies to continue struggling with simple image related tasks for quite some time into the future.

Spotting Image Differences in Visual Software Testing with AI

Key Takeaways

Vision in Software Testing

State of AI

How do humans do it?

Tolerance to pixel movements

Coping with large displacements

A possible solution

Conclusion

References

Leave a Reply Cancel reply

Stay Connected

Latest News

Your next Pixel update could make it easier to check your Now Playing history (APK teardown)

Cruise ship crew member stabbed colleague, 28, before jumping to his death

App Store developers must now provide age rating details

Last chance before millions of passwords disappear from popular app feature

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Key Takeaways

Vision in Software Testing

State of AI

How do humans do it?

Tolerance to pixel movements

Coping with large displacements

A possible solution

Conclusion

References

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News