World Models Could Unlock The Next Revolution In Artificial Intelligence

You’ve probably seen an artificial intelligence system go off the rails. You ask for a video of a dog, and as the dog runs behind the loveseat, his collar disappears. As the camera pans back, the loveseat becomes a sofa.

Part of the problem lies in the predictive nature of many AI models. Like the models that power ChatGPT, which are trained to predict text, video generation models predict what is statistically most plausible to then look at. In neither case does the AI have a clearly defined world model that is constantly updated to make more informed decisions.

But that is starting to change as researchers in many AI domains work to create “world models,” with implications that extend beyond video generation and chatbot use, to augmented reality, robotics, autonomous vehicles, and even human intelligence – or artificial general intelligence (AGI).

About supporting science journalism

If you like this article, please consider supporting our award-winning journalism by subscribe. By purchasing a subscription, you help shape the future of impactful stories about the discoveries and ideas shaping our world today.

A simple way to understand world modeling is through four-dimensional or 4D models (three dimensions plus time). For this, let’s think back to 2012, back then Titanic, 15 years after its theatrical release, it was painstakingly converted into stereoscopic 3D. If you were to freeze a frame, you would get a sense of the distance between characters and objects on the ship. But if Leonardo DiCaprio stood with his back to the camera, you wouldn’t be able to walk around him to see his face. The cinema illusion of 3D is created using stereoscopy: two slightly different images that are often projected quickly alternately, one for the left eye and one for the right eye. Everyone in the cinema sees the same pair of images and therefore a similar perspective.

However, thanks to the past decade of research, more and more perspectives are possible. Imagine realizing you should have taken a photo from a different angle and then having AI make that adjustment, creating the same scene with a new perspective. As of 2020, NeRF (neural radiance field) algorithms provided a path to creating “photorealistic new renderings,” but they required combining many photos so that an AI system could generate a 3D rendering. Other 3D approaches use AI to predictively fill in missing information, making them more deviant from reality.

Now imagine that every frame Titanic were rendered in 3D, so the film existed in 4D. You can scroll through time to see different moments, or scroll through space to see it from different perspectives. You can also generate new versions of it. For example, a recent preprint, “NeoVerse: Enhancing 4D World Model with in-the-Wild Monocular Videos,” describes a way to convert videos into 4D models to generate new videos from different perspectives.

But 4D techniques can also help generate new video content. Another recent preprint, “TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model,” applies to the scenario we started with: the dog running behind the loveseat. The authors claim that the stability of AI video systems improves when a continuously updated 4D world model guides generation. The system’s 4D model is said to help prevent the loveseat from becoming a couch and the dog from losing its collar.

These are early results, but they point to a broader trend: models that update an internal scene map as they are generated. Yet 4D modeling has applications far beyond video generation. For augmented reality (AR) – think Meta’s Orion prototype glasses – a 4D world model is an evolving map of the user’s world over time. It allows AR systems to keep virtual objects stable, make lighting and perspective believable, and have a spatial memory of what has happened recently. It also enables occlusions – when digital objects disappear behind real objects. A 2023 article states the requirement bluntly: “Achieving occlusion requires a 3D model of the physical environment.”

The ability to quickly convert videos to 4D also provides rich data for training robots and autonomous vehicles on how the real world works. And by generating 4D models of the space they’re in, robots can better navigate it and predict what might happen next. Current general-purpose AI models – which understand images and text but do not generate clearly defined world models – often make mistakes; a benchmark paper presented at a conference in 2025 reports “striking limitations” in their fundamental world modeling skills, including “nearly arbitrary accuracy in distinguishing motion trajectories.”

Here’s the catch: “world model” means a lot more to those pursuing AGI. For example, today’s leading large language models (LLMs), such as those powering ChatGPT, have an implicit view of the world based on their training data. “In a sense, I would say that the LLM already has a very good model of the world; we just don’t really understand how it does that,” says Angjoo Kanazawa, an assistant professor of electrical engineering and computer science at the University of California, Berkeley. However, these conceptual models do not provide real-time physical insight into the world because LLMs cannot update their training data in real time. Even OpenAI’s technical report notes that the GPT-4 model, once implemented, “does not learn from experience.”

“How do you develop an intelligent LLM vision system that can actually take streaming input and update its understanding of the world and act accordingly?” Kanazawa says. “That’s a big open problem. I don’t think AGI is possible without actually solving this problem.”

Although researchers debate whether LLMs could ever achieve AGI, many see LLMs as part of future AI systems. The LLM would act as the layer for “language and common sense to communicate,” says Kanazawa; it would serve as an ‘interface’, while a more clearly defined underlying world model would provide the necessary ‘spatial temporal memory’ that current LLMs lack.

In recent years, a number of leading AI researchers have turned to world models. In 2024, Fei Fei Li founded World Labs, which recently launched its Marble software to create 3D worlds from “text, images, video or rough 3D layouts,” according to the startup’s promotional materials. And last November, AI researcher Yann LeCun announced on LinkedIn that he was leaving Meta to launch a startup, now called Advanced Machine Intelligence (AMI Labs), to “build systems that understand the physical world, have persistent memory, can reason, and plan complex action sequences.” He seeded these ideas in a 2022 position paper in which he asked why people can act well in situations they have never encountered, and argued that the answer “may lie in the ability… to learn world models, internal models of how the world works.” Research increasingly shows the benefits of internal models. An April 2025 Nature paper reported results on DreamerV3, an AI agent that, by learning a world model, can improve its behavior by ‘imagining’ future scenarios.

So while in the context of AGI ‘world model’ refers more narrowly to an internal model of how reality works, and not just 4D reconstructions, advances in 4D modeling could yield components that aid in viewpoint understanding, memory and even short-term predictions. And in the meantime, on the path to AGI, 4D models can provide rich simulations of reality in which we can test AIs to ensure that if we let them operate in the real world, they will know how to exist in it.