Most video generation models are trained to look realistic, not to understand cause and effect. This paper argues that embodied AI needs action-conditioned world models and new benchmarks that measure whether predictions actually help robots learn to act in the real world.
