Apple researchers have published a study that looks into how LLMs can analyze audio and motion data to get a better overview of the user’s activities. Here are the details.
They’re good at it, but not in a creepy way
A new paper titled “Using LLMs for Late Multimodal Sensor Fusion for Activity Recognition” offers insight into how Apple may be considering incorporating LLM analysis alongside traditional sensor data to gain a more precise understanding of user activity.
This, they argue, has great potential to make activity analysis more precise, even in situations where there isn’t enough sensor data.
From the researchers:
“Sensor data streams provide valuable information around activities and context for downstream applications, though integrating complementary information can be challenging. We show that large language models (LLMs) can be used for late fusion for activity classification from audio and motion time series data. We curated a subset of data for diverse activity recognition across contexts (e.g., household activities, sports) from the Ego4D dataset. Evaluated LLMs achieved 12-class zero- and one-shot classification F1-scores significantly above chance, with no task-specific training. Zero-shot classification via LLM-based fusion from modality-specific models can enable multimodal temporal applications where there is limited aligned training data for learning a shared embedding space. Additionally, LLM-based fusion can enable model deploying without requiring additional memory and computation for targeted application-specific multimodal models.”
In other words, LLMs are actually pretty good at inferring what a user is doing from basic audio and motion signals, even when they’re not specifically trained for that. Moreover, when given just a single example, their accuracy improves even further.
One important distinction is that in this study, the LLM wasn’t fed the actual audio recording, but rather, short text descriptions generated by audio models and an IMU-based motion model (which tracks movement through accelerometer and gyroscope data), as shown below:
Diving a bit deeper
In the paper, the researchers explain that they used Ego4D, a massive dataset of media shot in first-person perspective. The data contains thousands of hours of real-world environments and situations, from household tasks to outdoor activities.
From the study:
“We curated a dataset of day-to-day activities from the Ego4D dataset by searching for activities of daily living within the provided narrative descriptions. The curated dataset includes 20 second samples from twelve high-level activities: vacuum cleaning, cooking, doing laundry, eating, playing basketball, playing soccer, playing with pets, reading a book, using a computer, washing dishes, watching TV, workout/weightlifting. These activities were selected to span a range of household and fitness tasks, and based on their prevalence in the larger dataset.”
The researchers ran the audio and motion data through smaller models that generated text captions and class predictions, then fed those outputs into different LLMs (Gemini-2.5-pro and Qwen-32B) to see how well they could identify the activity.
Then, Apple compared the performance of these models in two different situations: one in which they were given the list of the 12 possible activities to choose from (closed-set), and another where they weren’t given any options (open-ended).
For each test, they were given different combinations of audio captions, audio labels, IMU activity prediction data, and extra context, and this is how they did:
In the end, the researchers note that the results of this study offer interesting insights into how combining multiple models can benefit activity and health data, especially in cases where raw sensor data alone is insufficient to provide a clear picture of the user’s activity.
Perhaps more importantly, Apple published supplemental materials alongside the study, including the Ego4D segment IDs, timestamps, prompts, and one-shot examples used in the experiments, to assist researchers interested in reproducing the results.
Accessory deals on Amazon
FTC: We use income earning auto affiliate links. More.
