The field of Artificial Intelligence is experiencing an unprecedented surge of innovation, yet the public discourse often remains fixated on Large Language Models (LLMs). At the recent NVIDIA GTC 2025, a captivating conversation between Bill Dally and Yann LeCun , Chief AI Scientist at Meta, peeled back the layers of current AI advancements, revealing a vision that extends far beyond token prediction. LeCun’s insights challenge conventional wisdom, emphasizing a shift towards systems that genuinely understand, reason, and interact with our complex physical world.
Moving Beyond the Language Frontier
Yann LeCun openly admits that he’s
not so interested in LLMs anymore.
While they continue to improve at the margins through more data, compute, and synthetic data generation, LeCun views them as a “simplistic way of viewing reasoning”. He posits that the truly exciting questions in AI lie in four critical areas that will define the next wave of advanced machine intelligence (AMI):
- Understanding the Physical World: How can machines grasp the nuances of real-world physics and interaction?
- Persistent Memory: Developing AI systems with the capacity for long-term, accessible memory.
- Reasoning: Moving beyond current, often rudimentary, forms of reasoning in LLMs to more sophisticated, intuitive methods.
- Planning: Enabling AI to plan sequences of actions to achieve specific goals, similar to human cognitive processes.
LeCun suggests that the tech community, while currently focused on LLMs, will likely become excited about these “obscure academic papers” in five years’ time.
The Challenge of the Real World: Why Tokens Fall Short
The fundamental limitation of current LLMs, according to LeCun, lies in their token-based approach. Tokens, typically representing a finite set of possibilities (around 100,000 for LLMs), are well-suited for discrete data like language. However, the physical world is “high-dimensional and continuous”.
Humans acquire “world models” in the first few months of life, allowing us to understand cause and effect – for instance, how pushing a bottle from the top might flip it, while pushing it from the bottom might make it slide. This intuitive understanding of physics is profoundly difficult to replicate with systems designed to predict discrete tokens.
Attempts to train systems to understand the world by predicting high-dimensional, continuous data like video at a pixel level have largely failed. Such systems exhaust their resources trying to invent unpredictable details, leading to a “complete waste of resources”. Even self-supervised learning techniques that work by reconstructing images from corrupted versions haven’t performed as well as alternative architectures. This is because many aspects of reality are inherently unpredictable at a granular level, such as the exact appearance of every person in a video continuation.
Joint Embedding Predictive Architectures (JAPA): The Future of World Models
The answer to this challenge, LeCun argues, lies in Joint Embedding Predictive Architectures (JAPA). Unlike generative models that attempt pixel-level reconstruction, JAPA focuses on learning “abstract representations” of data.
How JAPA Works:
- A piece of input (e.g., a chunk of video or an image) is run through an encoder to produce an abstract representation.
- A continuation or transformed version of the input is also run through an encoder.
- The system then attempts to make predictions within this “representation space” (latent space), rather than in the raw input space. This is akin to “filling in the blank” in a more abstract, semantic way.
This approach avoids the collapse problem where systems might ignore input and produce constant, uninformative representations, a hurdle that took years to overcome.
JAPA for Reasoning and Planning: For agentic systems that can reason and plan, JAPA offers a powerful mechanism. Imagine a predictor that, upon observing the current state of the world, can anticipate the “next state of the world given that I might take an action that I’m imagining taking”. This allows for planning a sequence of actions to achieve a desired outcome, mirroring how humans inherently reason and plan.
LeCun strongly contrasts this with current “agentic reasoning systems” that generate vast numbers of token sequences and then use a second neural network to select the best one. He likens this to “writing a program without knowing how to write a program” – a “completely hopeless” method for anything beyond short sequences, as it scales exponentially with length. Instead, true reasoning occurs in an abstract mental state, not “kicking tokens around”. A cat, for example, plans complex jump trajectories without using language or tokens.
A practical example of JAPA’s potential is the VJA (Video Joint Embedding Predictive Architecture) project, currently in development at Meta. The VJA system, trained on short video segments to predict representations of full videos from masked versions, is demonstrating an ability to detect whether a video is “physically possible or not”. By measuring prediction error, it can flag “unusual” events like objects spontaneously appearing or disappearing, or defying physics. This mirrors how baby humans learn intuitive physics: a 9-month-old baby is surprised if an object appears to float, indicating a violation of their internal world model.
The Road to Advanced Machine Intelligence (AMI)
LeCun prefers the term Advanced Machine Intelligence (AMI) over Artificial General Intelligence (AGI), citing the highly specialized nature of human intelligence. He estimates that we could have a “good handle on getting this [AMI] to work at least at a small scale within three to five years
with human-level AI potentially arriving within a decade or so.
However, he cautions against the historical pattern of over-optimism in AI, where each new paradigm is proclaimed as the path to human-level intelligence within a decade. He dismisses the idea that merely scaling up LLMs or generating thousands of token sequences will lead to human-level intelligence as “nonsense”.
A major bottleneck is data. LLMs are trained on vast amounts of text (e.g., 30 trillion tokens, equivalent to 400,000 years of reading). In contrast, a 4-year-old child processes an equivalent amount of data through vision in just 16,000 hours, demonstrating the immense efficiency of visual learning. This disparity underscores that we “are never going to get to AGI… by just training from text”.
The key to unlocking AMI, according to LeCun, is discovering the “good recipe” for training JAPA architectures at scale. Just as it took time to figure out the right combination of engineering tricks, non-linearities, and innovations like ResNet (the most cited paper in science over the last decade) to effectively train deep neural networks and transformers, a similar breakthrough is needed for JAPA.
AI’s Impact: From Life-Saving to Productivity Tools
Despite the focus on future paradigms, LeCun highlights the immense positive impact AI is already having:
- Science and Medicine: AI is transforming drug design, protein folding, and understanding life mechanisms. In medical imaging, deep learning systems pre-screen mammograms for tumors, and AI reduces MRI scan times by a factor of four by recovering high-resolution images from less data.
- Automotive: Driving assistance and automatic emergency braking systems, now mandatory in Europe, reduce collisions by 40%, saving lives.
- Productivity and Creativity: AI is not replacing people but serving as “power tools” that make individuals more productive and creative, whether as coding assistants, in medicine, or in artistic endeavors.
However, the path to widespread deployment isn’t always smooth. The need for “accuracy and reliability” in applications like autonomous driving (where mistakes can be deadly) makes fielding and deploying AI systems “more difficult than most people had thought”. This is where AI often fails – not in the basic technique or demos, but in integrating reliably into existing systems. Yet, for many applications where consequences of error are not disastrous (e.g., entertainment, education, or doctor-checked medical uses), AI that is “right most of the time” is already highly beneficial.
Regarding the “dark side” of AI, such as deepfakes and false news, LeCun expresses surprising optimism. Meta’s experience suggests that, despite the availability of LLMs, they haven’t seen a “big increase in generative content being posted on social networks, or at least not in a nefarious way”. He recounts the “Galactica” episode, where Meta’s open-source LLM for scientific literature was met with “vitriol” and taken down due to fear-mongering, only for ChatGPT to be celebrated weeks later. LeCun believes that the “countermeasure against misuse is just better AI” – systems with common sense, reasoning capacity, and the ability to assess their own reliability. He dismisses catastrophic scenarios, believing “people adapt” and that AI is “mostly for good”.
The Indispensable Role of Open Source and Global Collaboration
A core tenet of LeCun’s philosophy is the absolute necessity of open-source AI platforms. He emphasizes that “good ideas come from the interaction of a lot of people and the exchange of ideas”. No single entity has a monopoly on innovation, as demonstrated by the groundbreaking ResNet architecture, which came from Chinese scientists at Microsoft Research Beijing.
Meta’s commitment to open-source, exemplified by PyTorch and LLaMA, is driven by the belief that it fosters a thriving ecosystem of startups and allows the largest number of smart people to contribute to building essential functionalities. LLaMA, a state-of-the-art LLM offered with open weights, has seen over a billion downloads, sparking a revolution in the AI landscape.
Why Open Source AI is Crucial for the Future:
- Diversity of AI Assistants: In a future where AI mediates nearly every digital interaction (e.g., smart glasses), a single handful of companies cannot provide the diversity of assistants needed. We require assistants that understand “all the world’s languages, all the world’s cultures, all the value systems,” and can embody diverse biases and opinions, much like a diverse press is vital for democracy.
- Distributed Training: No single entity will collect all the world’s data in all languages. The future model involves open-source foundation models trained in a distributed fashion, with data centers globally accessing subsets of data to train a “consensus model”.
- Fine-Tuning on Proprietary Data: Open-source models like LLaMA allow companies to download and fine-tune them on their own proprietary data without having to upload it, supporting specialized vertical applications and startup business models.
LeCun highlights that companies whose revenue isn’t solely tied to AI services (like Meta’s advertising model) have less to lose and more to gain from open-sourcing their models, contrasting this with companies like Google that might see it as a threat to their core search business.
Hardware: Fueling the Next AI Revolution
The journey towards AMI and sophisticated world models will demand ever-increasing computational power. While GPUs have seen incredible advancements (5,000 to 10,000 times increase in capability from Kepler to Blackwell), the computational expense of reasoning in abstract space means “we’re going to need all the competition we can get” in hardware.
LeCun is largely skeptical of neuromorphic hardware, optical computing, and quantum computing for general AI tasks in the near future. He points out that the digital semiconductor industry is in such a “deep local minimum” that alternative technologies face a monumental challenge to catch up. While the brain communicates digitally via spikes, neuromorphic approaches often struggle with hardware reuse and efficient multi-chip communication.
However, he sees promise in Processor-in-Memory (PIM) or analog/digital processor and memory technologies for specific “edge computation” scenarios, such as low-power visual processing in smart glasses. The biological retina offers an analogy: it processes immense visual data on the sensor to compress it before sending it to the visual cortex, demonstrating that shuffling data, not computation itself, often consumes the most energy. This is a promising direction for energy-efficient, always-on AI.
The Future: A Staff of Super-Intelligent Virtual People
Ultimately, LeCun envisions a future where AI systems are “power tools” that augment human capabilities, not replace them. Our relationship with future AI will be one of command; we will be their “boss,” with a “staff of super-intelligent virtual people working for us”. This collaborative future, driven by open research and open-source platforms, will leverage contributions from everyone around the world, leading to a diverse array of AI assistants that enhance our daily lives.
In essence, the future of AI isn’t a monolithic, black-box entity that suddenly appears. Instead, it’s a collaborative, iterative process, much like constructing a grand, intricate city where each builder, architect, and engineer contributes their unique expertise to a shared blueprint, leading to a vibrant and diverse metropolis of advanced machine intelligence.