They need well-factored accurate multimodal world models. Things like transformers and stable diffusion are promising such as the 3d video stable diffusion paper or DeepMind's multimodal transformer.
One thing that has held back progress is the way putting knowledge directly into the system has become taboo. So much so that they often fail to even guide the training towards really core aspects of the world model. Or even deliberately going about it with the assumption that everything from start to finish must be determined from the barest input data such as pixels. Then being surprised when it learns random inaccurate and overfit models that miss the underlying hierarchical structures.
One thing that has held back progress is the way putting knowledge directly into the system has become taboo. So much so that they often fail to even guide the training towards really core aspects of the world model. Or even deliberately going about it with the assumption that everything from start to finish must be determined from the barest input data such as pixels. Then being surprised when it learns random inaccurate and overfit models that miss the underlying hierarchical structures.