MIT Researchers Develop New Method for More Accurate AI-Generated Code

In a significant stride towards enhancing the reliability of AI-generated code, researchers at MIT, in collaboration with other institutions, have introduced a novel approach to ensure the accuracy and validity of code produced by large language models (LLMs). This innovation addresses a critical challenge in the field, as LLMs, while speeding up code generation, often struggle with adhering to the strict rules of programming languages, leading to potential errors and system crashes.

The new method guides LLMs to generate code that not only conforms to the rules of the specific programming language but is also free of errors. By enabling the LLM to focus on outputs that are most likely to be valid and accurate, and discarding less promising options early on, the approach significantly boosts computational efficiency.

This efficiency gain allows smaller LLMs to rival the performance of much larger models in generating accurate, well-structured outputs across diverse real-world applications, including molecular biology and robotics. The potential impact of this architecture extends to empowering non-experts to control AI-generated content, such as enabling business professionals to formulate complex SQL queries using natural language prompts.

According to João Loula, an MIT graduate student and co-lead author of the research paper, this work has implications beyond academic research, potentially improving programming assistants, AI-powered data analysis, and scientific discovery tools by ensuring AI-generated outputs are both useful and correct.

The research team, led by Vikash Mansinghka of MIT and Timothy J. O’Donnell of McGill University, employed a technique called sequential Monte Carlo to enable parallel generation from an LLM. This allows multiple outputs to compete with each other, dynamically allocating resources based on the promise of each output. Each output is assigned a weight reflecting its likelihood of being structurally valid and semantically accurate. The model then focuses on those with higher weights, discarding the rest.

In essence, the LLM benefits from an “expert” that ensures correct choices at each step while maintaining focus on the overall goal. The user specifies the desired structure and meaning, and the architecture guides the LLM accordingly.

The researchers tested their framework on LLMs tasked with generating Python code, SQL database queries, molecular structures, and robot plans. The results showed their method achieved greater accuracy with less computation compared to existing approaches. For example, in Python code generation, their architecture allowed a small, open-source model to outperform a larger, commercial closed-source model.

Looking ahead, the researchers aim to extend their technique to control larger segments of generated text and combine it with learning mechanisms to further improve accuracy. This project holds promise for broader applications, including automated data modeling and querying generative models of databases, potentially transforming machine-assisted data analysis systems.

This research is funded and supported, in part, by the Canada CIFAR AI Chairs Program, the MIT Quest for Intelligence, and Convergent Research.