Visually impaired iPhone users may get more out of Look Around in the future
Apple engineers have detailed an AI agent that accurately describes Street View scenes. If the research pans out, it could become a tool to help visually impaired people virtually explore a location in advance.
Blind and visually impaired people already have tools at their disposal to navigate their devices and their local environment. However, Apple believes it could be beneficial for the same people to know about a place’s physical features before visiting it.
A paper released through Apple Machine Learning Research on Monday talks about SceneScout, a multi-modal large language model-driven AI agent. The key to the agent is that it can be used to view Street View imagery, analyze what is seen, and to describe it to the viewer.
The paper is authored by Leah Findlater and Cole Gleason of Apple, as well as Gaurav Jain of Columbia University.
It is explained that people with low vision may hesitate to travel independently in environments unfamiliar to them, since they don’t know about the physical landscape they will encounter in advance.
There are tools available to describe the local environment, such as Microsoft’s Soundscape app from 2018. However, they are all designed to work in-situ, and not in advance.
At the moment, pre-travel advice provides details like landmarks and turn-by-turn navigation, which do not provide much in the way of landscape context for visually impaired users. However, Street View style imagery, such as Apple Maps Look Around, often presents sighted users with a lot more contextual clues, which are often missed out on by people who cannot see it.
SceneScout
This is where SceneScout steps in, as an AI agent to provide accessible interactions using Street View imagery.
There are two modes to Scene Scout, with Route Preview providing details of elements it can observe on a route. For example, it could advise of trees at a turning and other more tactile elements to the user.
A second mode, Virtual Exploration, is described as enabling free movement within Street View imagery, describing elements to the user as they virtually move.
In its user study, the team determined that SceneScout is helpful to visually impaired people, in terms of uncovering information that they would not otherwise access using existing methods.
When it comes to descriptions, the majority are deemed to be accurate, at 72% of the time, and can describe stable visual elements 95% of the time. However, occasional “subtle and plausible errors” make the descriptions difficult to verify without using sight.
When it comes for ways to improve the system, the test participants proposed that SceneScout could provide personalized descriptions that adapt over multiple sessions. For example, the system could pick up on the types of information the user prefers to hear about.
The shift of perspective for descriptions from the viewpoint of the camera on top of a car to where pedestrians would be normally located could also help improve the information.
One other way to improve the system is also one that could be done in-situ. The participants said they would love for the Street View descriptions to be provided in real-time, to match where they are walking.
The participants said this could be an application that provides the visual information through bone conduction headphones or a transparency mode as they move around. Furthermore, users may want to use a combination of a gyroscope and compass in a device to point in a general direction for environmental details, rather than hoping they line up a camera right for computer vision.
Future uses
Much like a patent filing, a paper detailing the use of AI in new ways does not guarantee that it will be available in a future product or service. However, it does provide a glimpse into applications Apple has considered for the technology.
While not using Street View imagery, a similar approach could take advantage of a few rumored inbound Apple products.
Apple is thought to be creating AirPods with built-in cameras, as well as Apple Glass smart glasses with its own cameras. In both cases, the cameras could give Apple Intelligence a view of the world, which then would be used to help answer queries for the user.
It’s not much of a stretch to imagine a similar system being used to describe the local environment to a user. All by using live data instead of potentially dated Street View images.