This “smart coach” helps LLMs switch between text and code

Large language models (LLMs) have revolutionized how we interact with information, showcasing unparalleled textual reasoning abilities. However, their prowess often falters when faced with seemingly simple computational or algorithmic tasks, such as solving basic math problems. While some LLMs can generate code, they frequently lack the inherent intelligence to discern when or what type of code is most appropriate for a given query.

Recognizing this critical gap, researchers at MIT have unveiled CodeSteer, an innovative “smart assistant” designed to guide LLMs in dynamically switching between text and code generation. CodeSteer, itself a compact LLM, acts as a dedicated coach, generating a series of iterative prompts that steer a larger LLM towards the optimal approach until a correct answer is achieved.

The impact of CodeSteer is significant. Researchers observed that augmenting a larger LLM with CodeSteer led to an impressive accuracy boost of over 30 percent on symbolic tasks. These tasks ranged from multiplying numbers and playing Sudoku to stacking blocks, demonstrating a profound improvement in problem-solving capabilities. Furthermore, CodeSteer empowered less sophisticated LLMs to surpass the performance of more advanced models known for their enhanced reasoning skills.

This breakthrough holds immense potential for addressing complex real-world challenges that are difficult to tackle with textual reasoning alone. Applications include generating optimal paths for robots in unpredictable environments or streamlining scheduling processes within intricate international supply chains.

“There is a race to develop better and better models that are capable of doing everything, but we’ve taken a complementary approach,” explains Chuchu Fan, an associate professor of aeronautics and astronautics (AeroAstro) and principal investigator in the MIT Laboratory for Information and Decision Systems (LIDS). Fan, the senior author of the study, emphasizes, “We want to enable LLMs to select the right tools and methods, and make use of others’ expertise to enhance their own capabilities.”

The research, which will be presented at the International Conference on Machine Learning, includes contributions from LIDS graduate student Yongchao Chen, AeroAstro graduate student Yilun Hao, University of Illinois at Urbana-Champaign graduate student Yueying Liu, and MIT-IBM Watson AI Lab Research Scientist Yang Zhang.

Inspired by human trainers who guide athletes without necessarily outperforming them, CodeSteer operates by not retraining powerful LLMs like GPT-4 or Claude. Instead, it fine-tunes a smaller, lightweight LLM to provide targeted guidance. This approach ensures that the larger model’s existing capabilities remain intact.

Yongchao Chen notes, “We were also inspired by humans. In sports, a trainer may not be better than the star athlete on the team, but the trainer can still give helpful suggestions to guide the athlete. This steering method works for LLMs, too.”

CodeSteer’s workflow begins by reviewing a query to determine the most suitable approach: coding or textual reasoning. It then generates a specific prompt for the larger LLM. Upon receiving the larger model’s response, CodeSteer meticulously reviews it. If the answer is incorrect, CodeSteer continues to prompt the LLM, suggesting different strategies to refine the solution, such as incorporating specific search algorithms or constraints into its Python code.

To prevent the larger LLM from opting for lazy or inefficient code, CodeSteer incorporates a symbolic checker that evaluates code complexity. An additional self-answer checker prompts the LLM to generate code to verify its own answer, further enhancing accuracy.

Facing a lack of suitable symbolic datasets for fine-tuning and testing, the researchers curated their own corpus of 37 complex symbolic tasks, including spatial reasoning, mathematics, order reasoning, and optimization. This new dataset, named SymBench, was instrumental in maximizing CodeSteer’s performance.

In extensive experiments, CodeSteer consistently outperformed all nine baseline methods evaluated, significantly boosting average accuracy from 53.3 percent to 86.4 percent. This robust performance was maintained even on unseen tasks and across a variety of LLMs. Crucially, a general-purpose model augmented with CodeSteer achieved higher accuracy than state-of-the-art models specifically designed for complex reasoning and planning, all while requiring significantly less computation.

Jinsung Yoon, a staff research scientist at Google Cloud AI, lauded the work, stating, “The authors present an elegant solution to the critical challenge of tool utilization in LLMs. This simple yet impactful method enables state-of-the-art LLMs to achieve significant performance improvements without requiring direct fine-tuning.” Chi Wang, a senior staff scientist at Google DeepMind, added, “Their success in training a smaller, specialized model to strategically guide larger, advanced models is particularly impactful. This intelligent collaboration among diverse AI ‘agents’ paves the way for more robust and versatile applications in complex real-world scenarios.”

Looking ahead, the researchers aim to streamline CodeSteer’s iterative prompting process to enhance its speed. They are also exploring the possibility of fine-tuning a unified model that can seamlessly switch between textual reasoning and code generation without relying on a separate assistant. This pioneering research was supported, in part, by the U.S. Office of Naval Research and the MIT-IBM Watson AI Lab.