A new Apple-backed study, in collaboration with Aalto University in Finland, introduces ILuvUI: a vision-language model trained to understand mobile app interfaces from screenshots and from natural language conversations. Here’s what that means, and how they did it.
ILuvUI: an AI that outperformed the model it was based on
In the paper, ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations, the team tackles a long-standing challenge in human-computer interaction, or HCI: teaching AI models to reason about user interfaces like humans do, which in practice means visually, as well as semantically.
“Understanding and automating actions on UIs is a challenging task since the UI elements in a screen, such as list items, checkboxes, and text fields, encode many layers of information beyond their affordances for interactivity alone. (….) LLMs in particular have demonstrated remarkable abilities to comprehend task instructions in natural language in many domains, however using text descriptions of UIs alone with LLMs leaves out the rich visual information of the UI. “
Currently, as the researchers explain, most vision-language models are trained on natural images, like dogs or street signs, so they don’t perform as well when asked to interpret more structured environments, like app UIs:
“Fusing visual with textual information is important to understanding UIs as it mirrors how many humans engage with the world. One approach that has sought to bridge this gap when applied to natural images are Vision-Language Models (VLMs), which accept multimodal inputs of both images and text, typically output only text, and allow for general-purpose question answering, visual reasoning, scene descriptions, and conversations with image inputs. However, the performance of these models on UI tasks fall short compared to natural images because of the lack of UI examples in their training data.”
With that in mind, the researchers fine-tuned the open-source VLM LLaVA, and they also adapted its training method to specialize in the UI domain.
They trained it on text-image pairs that were synthetically generated following a few “golden examples”. The final dataset included Q&A-style interactions, detailed screen descriptions, predicted action outcomes, and even multi-step plans (like “how to listen to the latest episode of a podcast,” or “how to change brightness settings.”)
Once trained on this dataset, the resulting model, ILuvUI, was able to outperform the original LLaVA in both machine benchmarks and human preference tests.

What’s more, it doesn’t require a user to specify a region of interest in the interface. Instead, the model understands the entire screen contextually from a simple prompt:
ILuvUI (…) does not require a region of interest, and accepts a text prompt as input in addition to the UI image, which enables it to provide answers for use cases such as visual question answering.

How will users benefit from this?
Apple’s researchers say that their approach might prove useful for accessibility, as well as for automated UI testing. They also note that while ILuvUI is still based on open components, future work could involve larger image encoders, better resolution handling, and output formats that work seamlessly with existing UI frameworks, like JSON.
And if you’ve been keeping up to date with Apple’s AI research papers, you might be thinking of a recent investigation of whether AI models could not just understand, but also anticipate the consequences of in-app actions.
Put the two together, and things start to get… interesting, especially if you rely on accessibility to navigate your devices, or just wish the OS could autonomously handle the more fiddly parts of your in-app workflows.
External drive deals on Amazon
FTC: We use income earning auto affiliate links. More.