Home Blog Newsfeed AI Learns to Sketch Like Humans: MIT and Stanford’s SketchAgent Bridges the Gap
AI Learns to Sketch Like Humans: MIT and Stanford’s SketchAgent Bridges the Gap

AI Learns to Sketch Like Humans: MIT and Stanford’s SketchAgent Bridges the Gap

Words often fall short when trying to convey complex ideas. A quick sketch can often be the most efficient way to communicate or understand a concept, like diagramming a circuit to grasp how a system works. But what if AI could assist in creating and exploring these visualizations? While AI excels at realistic paintings and cartoon drawings, it often misses the essence of sketching: the iterative, stroke-by-stroke process that fuels brainstorming and idea refinement.

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Stanford University have developed a novel drawing system called “SketchAgent” that aims to bridge this gap. SketchAgent leverages a multimodal language model, similar to Anthropic’s Claude 3.5 Sonnet, which is trained on both text and images. This allows the system to transform natural language prompts into sketches in mere seconds. For example, it can create a doodle of a house independently or collaboratively, drawing with a human or incorporating text-based input to sketch each part separately.

The researchers have demonstrated SketchAgent’s ability to create abstract drawings of diverse concepts, ranging from robots and butterflies to DNA helices, flowcharts, and even the Sydney Opera House. This tool has the potential to evolve into an interactive art game, aiding teachers and researchers in diagramming complex concepts or providing users with quick drawing lessons.

A video demonstration of SketchAgent is available, showcasing its collaborative capabilities and how it enables AI models to sketch more like humans. The video is accessible at https://i1.ytimg.com/vi/F8WClut-eec/maxresdefault.jpg.

Yael Vinker, a CSAIL postdoc and lead author of the paper introducing SketchAgent, emphasizes that the system offers a more natural way for humans to interact with AI. “Not everyone is aware of how much they draw in their daily life. We may draw our thoughts or workshop ideas with sketches,” she says. “Our tool aims to emulate that process, making multimodal language models more useful in helping us visually express ideas.”

SketchAgent teaches these models to draw stroke-by-stroke without requiring any training data. Instead, the researchers developed a “sketching language” that translates a sketch into a numbered sequence of strokes on a grid. The system is provided with examples of how to draw objects like a house, with each stroke labeled according to its representation (e.g., the seventh stroke being a rectangle labeled as a “front door”), enabling the model to generalize to new concepts.

Vinker collaborated on the paper with CSAIL affiliates Tamar Rott Shaham, Alex Zhao, and Antonio Torralba, as well as Stanford University’s Kristine Zheng and Judith Ellen Fan. Their work will be presented at the 2025 Conference on Computer Vision and Pattern Recognition (CVPR) this month.

While text-to-image models like DALL-E 3 can create compelling drawings, they often lack the spontaneous, creative process inherent in sketching, where each stroke influences the overall design. SketchAgent models drawings as a sequence of strokes, resulting in a more natural and fluid appearance reminiscent of human sketches.

Previous attempts to mimic this process relied on training models on human-drawn datasets, which are limited in scale and diversity. SketchAgent leverages pre-trained language models, which possess knowledge of many concepts but lack sketching skills. By teaching these language models the sketching process, SketchAgent can sketch diverse concepts without explicit training.

To assess SketchAgent’s collaborative capabilities, the team tested the system in collaboration mode, where a human and the language model work together to draw a specific concept. Removing SketchAgent’s contributions revealed that its strokes were essential to the final drawing. For instance, in a drawing of a sailboat, removing the artificial strokes representing the mast rendered the sketch unrecognizable.

In another experiment, the researchers evaluated different multimodal language models within SketchAgent to determine which could create the most recognizable sketches. Claude 3.5 Sonnet, the default backbone model, generated the most human-like vector graphics, outperforming models such as GPT-4o and Claude 3 Opus.

“The fact that Claude 3.5 Sonnet outperformed other models like GPT-4o and Claude 3 Opus suggests that this model processes and generates visual-related information differently,” notes co-author Tamar Rott Shaham.

She adds that SketchAgent could become a valuable interface for collaborating with AI models beyond text-based communication. “As models advance in understanding and generating other modalities, like sketches, they open up new ways for users to express ideas and receive responses that feel more intuitive and human-like,” says Rott Shaham. “This could significantly enrich interactions, making AI more accessible and versatile.”

Despite its promise, SketchAgent is not yet capable of producing professional-quality sketches. It creates simple representations of concepts using stick figures and doodles but struggles with complex elements such as logos, sentences, intricate creatures like unicorns and cows, and specific human figures.

The model can also sometimes misinterpret users’ intentions in collaborative drawings, such as drawing a bunny with two heads. According to Vinker, this may be due to the model breaking down tasks into smaller steps using “Chain of Thought” reasoning. When working with humans, the model creates a drawing plan and might misinterpret which part of the outline a human is contributing to. Refining these skills could involve training on synthetic data from diffusion models.

Currently, SketchAgent often requires several rounds of prompting to generate human-like doodles. The team plans to improve the interface and streamline interaction and sketching with multimodal language models in the future.

Overall, SketchAgent demonstrates the potential for AI to draw diverse concepts in a manner similar to humans, facilitating step-by-step human-AI collaboration that leads to more aligned final designs.

This work was supported by the U.S. National Science Foundation, a Hoffman-Yee Grant from the Stanford Institute for Human-Centered AI, the Hyundai Motor Co., the U.S. Army Research Laboratory, the Zuckerman STEM Leadership Program, and a Viterbi Fellowship.

Add comment

Sign Up to receive the latest updates and news

Newsletter

© 2025 Proaitools. All rights reserved.