
A new way to edit or generate images
The landscape of artificial intelligence is continually evolving, and nowhere is this more evident than in AI image generation. Projected to become a multi-billion dollar industry this decade, this technology allows for the creation of intricate, fanciful images from simple text prompts in mere seconds. However, the process of training these generative models has historically been a significant hurdle, demanding vast computational resources and months of dedicated effort. But what if it were possible to generate and manipulate AI images without the need for a traditional generator at all?
This intriguing possibility is now a reality, thanks to groundbreaking research presented at the International Conference on Machine Learning (ICML 2025). The paper, authored by a collaborative team including Lukas Lao Beyer, Tianhong Li, Xinlei Chen, Sertac Karaman, and Kaiming He from MIT and Facebook AI Research, introduces novel techniques that could fundamentally redefine AI image creation. This innovation, brought to light on platforms like Proaitools, showcases the cutting edge of AI development.
The genesis of this impactful research traces back to a graduate seminar at MIT, where Lukas Lao Beyer’s class project blossomed into a full-fledged scientific endeavor. Their work builds upon a June 2024 paper that unveiled a one-dimensional tokenizer, a neural network capable of compressing a 256×256-pixel image into a mere 32 numbers, or “tokens.” Unlike previous tokenizers that created arrays of 16×16 tokens, each representing a small image portion, the new 1D tokenizers efficiently encode information about the entire image into far fewer tokens. Each of these 12-digit binary tokens acts as a “word” in an abstract, hidden language, offering about 4,000 possibilities – a unique vocabulary for the computer.
Lao Beyer’s initial exploration involved manipulating these tokens to understand their individual effects. Remarkably, replacing a single token could transform a low-resolution image into a high-resolution one, alter background blur, or adjust brightness. Even more surprisingly, a token was identified that could shift the “pose” within an image, for instance, changing a robin’s head orientation. This discovery was unprecedented; no one had previously observed such visually identifiable changes through token manipulation. This finding alone paves the way for a streamlined, automated approach to image editing, eliminating the need for manual, token-by-token adjustments.
The MIT group pushed this concept further, achieving a truly consequential result: image generation without a dedicated generator. Their novel approach combines a 1D tokenizer with a detokenizer (decoder) and the guidance of an off-the-shelf neural network called CLIP. While CLIP itself cannot generate images, it can measure how well an image matches a text prompt. Leveraging this, the team successfully transformed an image of a red panda into a tiger, and even generated entirely new images, like a tiger, by iteratively tweaking randomly assigned tokens until the reconstructed image matched the desired text prompt.
This innovative setup also proved effective for “inpainting” – filling in missing parts of images. The ability to bypass traditional generators for these tasks signifies a potential for significant reductions in computational costs, as training large generative models is notoriously resource-intensive. Kaiming He, a co-author, highlights the team’s ingenuity: “we didn’t invent anything new. We didn’t invent a 1D tokenizer, and we didn’t invent the CLIP model, either. But we did discover that new capabilities can arise when you put all these pieces together.”
Experts in the field are equally impressed. Saining Xie, a computer scientist at New York University, notes that this work “redefines the role of tokenizers,” demonstrating their capacity beyond mere image compression. Zhuang Liu of Princeton University echoes this sentiment, stating that it “demonstrates that image generation can be a byproduct of a very effective image compressor, potentially reducing the cost of generating images several-fold.” Beyond computer vision, the implications are vast. Sertac Karaman suggests tokenizing actions of robots or self-driving cars, a sentiment shared by Lao Beyer, who envisions applying this high compression to represent vehicle routes in autonomous systems. This pioneering research heralds a more efficient, versatile future for AI image technology and beyond.



