
MIT and Stanford AI Learns to Sketch Like Humans: SketchAgent System
Words sometimes fall short when trying to communicate or understand complex ideas. A quick sketch can often be more efficient, whether diagramming a circuit or brainstorming concepts. Now, researchers are exploring how artificial intelligence can enhance these visualizations.
While AI excels at generating realistic paintings and cartoon-like drawings, it often misses the essence of sketching: the iterative, stroke-by-stroke process that humans use to brainstorm and refine ideas. To address this, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Stanford University have developed “SketchAgent,” a new drawing system that mimics human sketching.
SketchAgent employs a multimodal language model, similar to Anthropic’s Claude 3.5 Sonnet, which is trained on both text and images. This allows the system to transform natural language prompts into sketches in just a few seconds. For example, it can create a sketch of a house independently or collaboratively, working with a human partner or incorporating text-based inputs to sketch individual components.
The researchers demonstrated SketchAgent’s ability to create abstract drawings of diverse concepts, including robots, butterflies, DNA helices, flowcharts, and even the Sydney Opera House. This tool has the potential to evolve into an interactive art game, aiding teachers and researchers in diagramming complex concepts or providing users with quick drawing lessons.
Yael Vinker, a CSAIL postdoc and lead author of the paper introducing SketchAgent, emphasizes that the system offers a more natural way for humans to communicate with AI. “Not everyone is aware of how much they draw in their daily life. We may draw our thoughts or workshop ideas with sketches,” she says. “Our tool aims to emulate that process, making multimodal language models more useful in helping us visually express ideas.”
SketchAgent teaches these models to draw stroke-by-stroke without requiring any training data. Instead, the researchers developed a “sketching language” that translates a sketch into a numbered sequence of strokes on a grid. The system receives examples of how to draw objects like a house, with each stroke labeled according to its representation, enabling the model to generalize to new concepts.
The research team, including CSAIL affiliates Tamar Rott Shaham, Alex Zhao, and Antonio Torralba, along with Stanford University’s Kristine Zheng and Judith Ellen Fan, will present their work at the 2025 Conference on Computer Vision and Pattern Recognition (CVPR) this month.
While text-to-image models like DALL-E 3 excel at creating detailed drawings, they often lack the spontaneous, creative process inherent in sketching. SketchAgent, on the other hand, models drawings as a sequence of strokes, resulting in a more natural and fluid appearance akin to human sketches.
Prior attempts to mimic this process have relied on training models with human-drawn datasets, which are often limited in scale and diversity. SketchAgent overcomes this limitation by utilizing pre-trained language models that possess extensive knowledge of various concepts but lack sketching skills. By teaching these models the sketching process, SketchAgent can sketch diverse concepts without explicit training.
To evaluate SketchAgent’s collaborative capabilities, the team tested the system in collaboration mode, where a human and the AI model work together to draw a specific concept. Removing SketchAgent’s contributions revealed that its strokes were crucial to the final drawing. For example, in a drawing of a sailboat, removing the AI-generated mast made the sketch unrecognizable.
Further experiments involved integrating different multimodal language models into SketchAgent to identify which could produce the most recognizable sketches. The default model, Claude 3.5 Sonnet, outperformed models like GPT-4o and Claude 3 Opus, generating the most human-like vector graphics.
“The fact that Claude 3.5 Sonnet outperformed other models like GPT-4o and Claude 3 Opus suggests that this model processes and generates visual-related information differently,” notes co-author Tamar Rott Shaham.
Rott Shaham also suggests that SketchAgent could serve as a valuable interface for collaborating with AI models beyond traditional text-based communication. “As models advance in understanding and generating other modalities, like sketches, they open up new ways for users to express ideas and receive responses that feel more intuitive and human-like,” she says. “This could significantly enrich interactions, making AI more accessible and versatile.”
Despite its promise, SketchAgent is not yet capable of producing professional-quality sketches. It creates simple representations using stick figures and doodles but struggles with complex designs like logos, sentences, or specific human figures. The model sometimes misunderstands user intentions, such as drawing a two-headed bunny in collaborative drawings. This may be due to the model breaking down tasks into smaller steps, potentially misinterpreting a human’s contribution to the overall plan.
Future improvements could include refining these drawing skills by training on synthetic data from diffusion models and streamlining the interaction process. Ultimately, SketchAgent represents a significant step toward human-AI collaboration in visual design, enabling more aligned and intuitive creative outcomes.
This work was supported, in part, by the U.S. National Science Foundation, a Hoffman-Yee Grant from the Stanford Institute for Human-Centered AI, the Hyundai Motor Co., the U.S. Army Research Laboratory, the Zuckerman STEM Leadership Program, and a Viterbi Fellowship.



