
MIT and Stanford AI System Learns to Sketch Like Humans with ‘SketchAgent’
In a world increasingly reliant on visual communication, a simple sketch can often convey ideas more efficiently than words. Recognizing this, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Stanford University have developed “SketchAgent,” a novel AI system designed to sketch in a manner more aligned with human thought processes.
Unlike many AI models that excel at generating realistic or cartoonish images but struggle with the iterative, stroke-by-stroke nature of sketching, SketchAgent captures the essence of brainstorming and visual editing. This new system leverages a multimodal language model, similar to Anthropic’s Claude 3.5 Sonnet, which is trained on both text and images. This allows it to translate natural language prompts into sketches within seconds.
SketchAgent can autonomously doodle a concept, such as a house, or collaborate with a human user, incorporating text-based input to sketch each part separately. The researchers demonstrated the system’s ability to create abstract drawings of diverse concepts ranging from robots and butterflies to DNA helices and the Sydney Opera House. This suggests its potential use in interactive art games, educational tools for diagramming complex concepts, and personalized drawing lessons.
A video demonstration of SketchAgent is available, showcasing its collaborative sketching capabilities:
Yael Vinker, CSAIL postdoc and lead author of the paper introducing SketchAgent, emphasizes the system’s potential to provide a more natural communication pathway between humans and AI. “Not everyone is aware of how much they draw in their daily life. We may draw our thoughts or workshop ideas with sketches,” she says. “Our tool aims to emulate that process, making multimodal language models more useful in helping us visually express ideas.”
SketchAgent teaches AI models to draw stroke-by-stroke without relying on extensive pre-existing datasets. Instead, the researchers developed a “sketching language” that translates a sketch into a numbered sequence of strokes on a grid. By providing examples, such as how a house is drawn with each stroke labeled, the model can generalize to new concepts.
The system’s capabilities were assessed by comparing it to existing text-to-image models like DALL-E 3. While DALL-E 3 can generate detailed drawings, it lacks the spontaneous, iterative process inherent in human sketching. SketchAgent, on the other hand, models drawings as a sequence of strokes, resulting in a more natural and fluid appearance.
To determine if SketchAgent genuinely collaborates with human users or operates independently, the team tested it in collaboration mode. Removing SketchAgent’s contributions revealed that its strokes were crucial to the final drawing, such as in the drawing of a sailboat where removing the mast rendered the sketch unrecognizable.
Different multimodal language models were also tested within SketchAgent to identify the most effective backbone model. Claude 3.5 Sonnet outperformed models like GPT-4o and Claude 3 Opus, generating the most human-like vector graphics.
According to co-author Tamar Rott Shaham, this suggests Claude 3.5 Sonnet processes and generates visual information differently. She envisions SketchAgent as a valuable interface for collaborating with AI models beyond text-based communication, making AI more accessible and versatile.
Despite its promise, SketchAgent is not yet capable of producing professional-quality sketches. It currently renders simple representations using stick figures and doodles and can struggle with complex images. The model can also occasionally misinterpret user intentions in collaborative drawings, sometimes requiring multiple prompts to generate desired results. The team plans to refine the interface and potentially train it on synthetic data from diffusion models.
In conclusion, SketchAgent demonstrates the potential for AI to draw diverse concepts in a manner similar to humans, facilitating step-by-step human-AI collaboration and resulting in more aligned final designs.



