Making AI-generated code more accurate in any language

The rapid advancements in artificial intelligence have brought forth large language models (LLMs) capable of generating complex computer code. While this promises to significantly accelerate programming, the utility of such AI-generated code hinges on its accuracy, adherence to programming language rules, and error-free execution. Until now, ensuring these crucial qualities has been a significant hurdle, often leading to code that either deviates from intended meaning or demands extensive, time-consuming corrections.

A groundbreaking new method developed by researchers at MIT and other leading institutions is poised to transform this landscape. This innovative approach automatically guides LLMs to produce text, including code, that strictly conforms to the rules of the target language while virtually eliminating errors. Crucially, their probabilistic method allows LLMs to intelligently allocate computational efforts towards outputs most likely to be valid and accurate, discarding less promising avenues early in the process. This dynamic resource management dramatically enhances computational efficiency.

The efficiency gains achieved through this novel architecture are remarkable. It has enabled smaller LLMs to surpass the performance of much larger models in generating precise and structurally sound outputs across diverse real-world applications. These include complex tasks in molecular biology, robotics, and various programming environments.

Beyond immediate programming benefits, this architecture holds profound implications for broader AI control, potentially empowering non-experts to interact with complex systems. For instance, business professionals could soon formulate intricate SQL database queries using only natural language prompts, bypassing the need for specialized coding knowledge.

“This work has implications beyond research. It could improve programming assistants, AI-powered data analysis, and scientific discovery tools by ensuring that AI-generated outputs remain both useful and correct,” states João Loula, an MIT graduate student and co-lead author of the paper detailing this framework.

Loula collaborated with co-lead authors Benjamin LeBrun from the Mila-Quebec Artificial Intelligence Institute and Li Du, a graduate student at John Hopkins University. The international team was led by co-senior authors Vikash Mansinghka ’05, MEng ’09, PhD ’09, a principal research scientist and leader of the Probabilistic Computing Project in the MIT Department of Brain and Cognitive Sciences; Alexander K. Lew SM ’20, an assistant professor at Yale University; Tim Vieira, a postdoc at ETH Zurich; and Timothy J. O’Donnell, an associate professor at McGill University and a Canada CIFAR AI Chair at Mila, among others. Their pioneering research is set to be presented at the prestigious International Conference on Learning Representations.

Traditionally, controlling the structured text from LLMs involved either checking an entire generated output, like a block of code, for validity and error-free execution – a computationally intensive process if errors required a full restart – or incrementally correcting the output. While incremental correction ensures structural validity, it often causes the code to drift from the user’s original semantic intent, compromising long-term accuracy.

“It is much easier to enforce structure than meaning. We can quickly check whether something is in the right programming language, but to check its meaning you have to execute the code. Our work is also about dealing with these different types of information,” Loula explains, highlighting the nuanced challenge their method addresses.

The core of the researchers’ approach involves seamlessly integrating expert knowledge into the LLM’s generative process, guiding it towards the most promising outputs. This ensures that the generated content not only adheres to user-defined structural constraints but also accurately reflects the user’s intended meaning.

“We are not trying to train an LLM to do this. Instead, we are engineering some knowledge that an expert would have and combining it with the LLM’s knowledge, which offers a very different approach to scaling than you see in deep learning,” adds Vikash Mansinghka.

This is achieved through a sophisticated technique called sequential Monte Carlo. This method enables parallel generation streams from an LLM to compete, with the model dynamically allocating resources to different computational threads based on their output’s apparent promise. Each output is assigned a weight, signifying its likelihood of being both structurally valid and semantically accurate. At every step, the model prioritizes higher-weighted outputs, effectively pruning unviable paths. This process is akin to an expert continuously overseeing the LLM’s choices, ensuring it stays on target while making optimal decisions.

“We’ve worked out the hard math so that, for any kinds of constraints you’d like to incorporate, you are going to get the proper weights. In the end, you get the right answer,” Loula affirms, underscoring the mathematical rigor behind their success.

To validate their approach, the framework was applied to LLMs tasked with generating four distinct types of outputs: Python code, SQL database queries, molecular structures, and robotic action plans. The results were compelling: their method consistently outperformed existing approaches in accuracy while demanding significantly less computation.

In a striking example, during Python code generation, the researchers’ architecture empowered a small, open-source model to surpass the performance of a specialized, commercial closed-source model that was more than double its size. “We are very excited that we can allow these small models to punch way above their weight,” Loula enthusiastically notes.

Looking ahead, the team aims to extend their technique to control larger segments of generated text and integrate learning mechanisms, allowing the model to become even more accurate over time. The potential applications are vast, ranging from enhanced machine-assisted data analysis systems capable of understanding user queries with greater semantic precision to improving generative models of databases and automated data modeling.

“One of the fundamental questions of linguistics is how the meaning of words, phrases, and sentences can be grounded in models of the world, accounting for uncertainty and vagueness in meaning and reference. LLMs, predicting likely token sequences, don’t address this problem. Our paper shows that, in narrow symbolic domains, it is technically possible to map from words to distributions on grounded meanings. It’s a small step towards deeper questions in cognitive science, linguistics, and artificial intelligence needed to understand how machines can communicate about the world like we do,” concludes O’Donnell, reflecting on the broader scientific impact.

This transformative research has received funding and support, in part, from the Canada CIFAR AI Chairs Program, the MIT Quest for Intelligence, and Convergent Research.