MIT and NVIDIA’s AI Tool Generates High-Quality Images 9x Faster

Researchers at MIT and NVIDIA have developed a new artificial intelligence tool called HART (Hybrid Autoregressive Transformer) that generates high-quality images significantly faster than existing state-of-the-art approaches. This innovation addresses a critical need for rapid image generation, particularly in applications like training self-driving cars, where realistic simulated environments are essential.

The current landscape of AI image generation is dominated by two main types of models: diffusion models and autoregressive models. Diffusion models, known for producing stunningly realistic images, are computationally intensive and slow. Autoregressive models, which power large language models (LLMs) like ChatGPT, are much faster but often produce lower-quality images with noticeable errors.

HART combines the strengths of both approaches. It uses an autoregressive model to quickly capture the overall structure of an image and then employs a small diffusion model to refine the finer details. This hybrid approach allows HART to generate images that match or exceed the quality of state-of-the-art diffusion models but at approximately nine times the speed.

According to the research paper, HART’s efficiency stems from its reduced computational demands, enabling it to run locally on commercial laptops or smartphones. Users can simply input a natural language prompt into the HART interface to generate an image.

The potential applications of HART are vast. It can aid researchers in training robots for complex real-world tasks and assist designers in creating visually striking scenes for video games. Haotian Tang SM ’22, PhD ’25, co-lead author of the paper, explains the core concept with an analogy: “If you are painting a landscape, and you just paint the entire canvas once, it might not look very good. But if you paint the big picture and then refine the image with smaller brush strokes, your painting could look a lot better. That is the basic idea with HART.”

The development team, led by Tang and Yecheng Wu (Tsinghua University), also included senior author Song Han, an associate professor at MIT, and researchers from NVIDIA and Tsinghua University. Their work will be presented at the International Conference on Learning Representations.

Diffusion models like Stable Diffusion and DALL-E generate high-quality images through an iterative process of adding and removing noise from each pixel. While effective, this process is slow and resource-intensive. Autoregressive models predict image patches sequentially, making them faster but prone to errors due to information loss during compression.

HART’s hybrid design addresses this by using an autoregressive model to predict compressed image tokens and then employing a small diffusion model to predict residual tokens. These residual tokens compensate for information loss by capturing fine details that discrete tokens might miss, such as the edges of objects or the details of a person’s face.

Tang notes, “We can achieve a huge boost in terms of reconstruction quality. Our residual tokens learn high-frequency details, like edges of an object, or a person’s hair, eyes, or mouth. These are places where discrete tokens can make mistakes.” The diffusion model only needs to refine the remaining details, allowing it to complete the task in fewer steps and retain the speed advantage of the autoregressive model.

During development, the researchers discovered that applying the diffusion model to predict only residual tokens in the final step significantly improved the quality of the generated images. The resulting model, a combination of an autoregressive transformer model with 700 million parameters and a lightweight diffusion model with 37 million parameters, can generate images of comparable quality to a 2 billion parameter diffusion model but nine times faster and with 31% less computation.

The autoregressive component of HART also makes it more compatible with unified vision-language generative models, paving the way for future applications where users can interact with the model through natural language to guide the image generation process. The researchers plan to explore building vision-language models on top of the HART architecture and applying it to video generation and audio prediction tasks.

The research was supported by the MIT-IBM Watson AI Lab, the MIT and Amazon Science Hub, the MIT AI Hardware Program, and the U.S. National Science Foundation, with GPU infrastructure donated by NVIDIA.