Home Blog Newsfeed MIT Researchers Develop AI That Self-Detoxifies Language, Ensuring Safer and More Ethical AI Communication
MIT Researchers Develop AI That Self-Detoxifies Language, Ensuring Safer and More Ethical AI Communication

MIT Researchers Develop AI That Self-Detoxifies Language, Ensuring Safer and More Ethical AI Communication

Large language models (LLMs), trained on vast datasets, often inherit biases and toxic language. Researchers at MIT, the MIT-IBM Watson AI Lab, and IBM Research have developed a novel method called self-disciplined autoregressive sampling (SASA) to combat this issue.

SASA enables LLMs to detoxify their own outputs without sacrificing fluency. This innovative decoding algorithm learns the boundary between toxic and non-toxic language within the LLM’s internal representation. It avoids altering the model’s parameters, retraining, or requiring an external reward model.

During inference, SASA assesses the toxicity of partially generated phrases. It selects words that place the phrase in a non-toxic space. This provides a fast and efficient way to generate less-toxic language.

“We wanted to find a way with any existing language model [that], during the generation process, the decoding can be subject to some human values; the example here we are taking is toxicity,” says Ching-Yun “Irene” Ko PhD ’24, the study’s lead author and a research scientist at IBM’s Thomas J. Watson Research Center.

The training resources for LLMs often include content collected from the internet and other public datasets. This content may contain curse words and bullying language. As a result, LLMs can produce dangerous or biased content, even from innocuous prompts.

Unlike methods that require LLM retraining or external reward models, SASA leverages the autoregressive nature of LLMs. It uses a decoding-based strategy during inference to steer the generation away from undesirable outputs.

The research group built a linear classifier that operates on the learned subspace from the LLM’s embedding. This classifier learns to distinguish between toxic and non-toxic language. The SASA system re-weights the sampling probabilities of potential tokens based on their distance to the classifier, encouraging non-toxic outputs.

The researchers evaluated SASA against several baseline interventions with GPT2-Large, Llama2-7b, and Llama 3.1-8b-Instruct. The results showed that SASA significantly reduced toxic language generation. It performed on par with state-of-the-art external reward model techniques. However, stronger detoxification was accompanied by a decrease in fluency.

“If we think about how human beings think and react in the world, we do see bad things, so it’s not about allowing the language model to see only the good things. It’s about understanding the full spectrum — both good and bad,” says Ko, “and choosing to uphold our values when we speak and act.”

SASA holds promise for future applications with multiple attributes. It could check a generation’s position in multiple subspaces, leading to more positive, fair, and principle-aligned language.

Add comment

Sign Up to receive the latest updates and news

Newsletter

Bengaluru, Karnataka, India.
Follow our social media
© 2025 Proaitools. All rights reserved.