MIT’s SASA: Training LLMs to Self-Detoxify Language Without Retraining

Large language models (LLMs), trained on vast public datasets, often inherit biases and toxic language. Researchers at MIT, the MIT-IBM Watson AI Lab, and IBM Research have developed a novel method, Self-disciplined Autoregressive Sampling (SASA), enabling LLMs to moderate their language without sacrificing fluency. This approach detoxifies outputs without retraining or external reward models.

Unlike other methods, SASA learns a boundary between toxic and nontoxic subspaces within the LLM’s internal representation. During inference, the algorithm assesses the toxicity of partially generated phrases. It selects new tokens (words) that place the phrase in a nontoxic space, offering a fast and efficient way to generate less-toxic language.

“We wanted to find a way with any existing language model [that], during the generation process, the decoding can be subject to some human values; the example here we are taking is toxicity,” says Ching-Yun “Irene” Ko PhD ’24, the study’s lead author and a research scientist at IBM’s Thomas J. Watson Research Center.

The training resources for LLMs often include content from the internet, containing curse words and bullying language. LLMs can produce or be tricked into generating dangerous or biased content, even from innocuous prompts. Mitigation strategies are needed to correct this.

SASA leverages the autoregressive nature of LLMs, steering generation away from unsavory outputs. The research group built a linear classifier operating on the learned subspace from the LLM’s embedding. Words with similar meanings are placed closely together in vector space, and the researchers hypothesized that this embedding would capture contextual information for detoxification.

The researchers used datasets containing prompts, responses, and human-attributed annotations (toxic or nontoxic). A Bayes-optimal classifier was applied to learn and draw a line between binary subspaces within the sentence embeddings.

The SASA system re-weights the sampling probabilities of the newest potential token based on its value and the generated phrase’s distance to the classifier, remaining close to the original sampling distribution. The goal is to change the autoregressive sampling process by re-weighting the probability of good tokens, reducing the sampling probability for toxic tokens.

The researchers evaluated their method against several baseline interventions with three LLMs: GPT2-Large, Llama2-7b, and Llama 3.1-8b-Instruct. For each prompt, the LLM completed the sentence 25 times, and PerspectiveAPI scored them for toxicity. The team looked at the average maximum toxicity score and the toxic rate. SASA was tested to complete RealToxicityPrompts (RPT), BOLD, and AttaQ datasets, which contained naturally occurring, English sentence prompts.

SASA achieved significant toxic language generation reductions, performing on par with RAD, a state-of-the-art external reward model technique. However, stronger detoxification accompanied a decrease in fluency. Before intervention, the LLMs produced more toxic responses for female labeled prompts than male; however, SASA was able to also significantly cut down harmful responses, making them more equalized.

“If we think about how human beings think and react in the world, we do see bad things, so it’s not about allowing the language model to see only the good things. It’s about understanding the full spectrum — both good and bad,” says Ko, “and choosing to uphold our values when we speak and act.”

Ko says SASA could work well for multiple attributes in the future: “For human beings, we have multiple human values. We don’t want to say toxic things, but we also want to be truthful, helpful, and loyal … If you were to fine-tune a model for all of these values, it would require more computational resources and, of course, additional training.”