
AI Learns to Sketch Like Humans: MIT and Stanford’s SketchAgent System
Words often fall short when trying to communicate complex ideas. A simple sketch can sometimes be the most effective way to convey a concept, like diagramming a circuit to understand how a system works. Now, researchers are exploring how artificial intelligence can aid in creating these visualizations.
While AI systems excel at generating realistic paintings and cartoon-like drawings, they often struggle to capture the essence of sketching: the iterative, stroke-by-stroke process that allows humans to brainstorm and refine their ideas. To bridge this gap, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Stanford University have developed a new drawing system called “SketchAgent” that sketches more like humans do.
SketchAgent utilizes a multimodal language model, similar to Anthropic’s Claude 3.5 Sonnet, which is trained on both text and images. This allows the system to transform natural language prompts into sketches in just a few seconds. For example, SketchAgent can independently doodle a house or collaborate with a human, incorporating text-based input to sketch each part separately.
The researchers demonstrated SketchAgent’s ability to create abstract drawings of various concepts, including robots, butterflies, DNA helices, flowcharts, and even the Sydney Opera House. This tool has the potential to evolve into an interactive art game that assists teachers and researchers in diagramming complex concepts or provides users with quick drawing lessons.
According to CSAIL postdoc Yael Vinker, the lead author of the paper introducing SketchAgent, this system offers a more natural way for humans to communicate with AI. Vinker notes, “Not everyone is aware of how much they draw in their daily life. We may draw our thoughts or workshop ideas with sketches. Our tool aims to emulate that process, making multimodal language models more useful in helping us visually express ideas.”
SketchAgent teaches AI models to draw stroke-by-stroke without any specific training data. Instead, the researchers developed a “sketching language” that translates a sketch into a numbered sequence of strokes on a grid. The system is provided with examples, such as how a house is drawn, with each stroke labeled according to its representation (e.g., the seventh stroke being a rectangle labeled as a “front door”). This approach helps the model generalize to new concepts.
In collaboration mode, where humans and AI work together, SketchAgent’s contributions are essential to the final drawing. For example, removing the strokes representing a mast in a sailboat drawing makes the overall sketch unrecognizable. When testing different multimodal language models, Claude 3.5 Sonnet generated the most human-like vector graphics, outperforming models like GPT-4o and Claude 3 Opus.
Co-author Tamar Rott Shaham suggests that SketchAgent could become a helpful interface for collaborating with AI models beyond standard text-based communication. “As models advance in understanding and generating other modalities, like sketches, they open up new ways for users to express ideas and receive responses that feel more intuitive and human-like,” says Rott Shaham. “This could significantly enrich interactions, making AI more accessible and versatile.”
While SketchAgent shows promise, it is not yet capable of producing professional sketches. It creates simple representations of concepts using stick figures and doodles but struggles with logos, sentences, complex creatures, and specific human figures. The model sometimes misunderstands user intentions in collaborative drawings, such as drawing a bunny with two heads. This may be due to the model breaking down each task into smaller steps, potentially misinterpreting a human’s contribution to the outline.
In the future, the team aims to make it easier to interact and sketch with multimodal language models, including refining their interface. Despite its current limitations, SketchAgent suggests that AI can draw diverse concepts in a human-like manner, with step-by-step human-AI collaboration leading to more aligned final designs.



