MIT Researchers Develop AI That Self-Detoxifies Language Using SASA

Large language models (LLMs), while powerful, often inherit biases and toxic language from the vast public datasets they are trained on. Now, researchers from MIT, the MIT-IBM Watson AI Lab, and IBM Research have developed a novel method called self-disciplined autoregressive sampling (SASA) that enables LLMs to detoxify their own outputs without compromising fluency.

Unlike existing detoxifying methods, SASA learns a boundary between toxic and non-toxic language within the LLM’s internal representation. This is achieved without altering the model’s parameters, retraining, or relying on an external reward model. During inference, the algorithm evaluates the toxicity of partially generated phrases and selects words that steer the output towards a non-toxic space, offering a fast and efficient way to generate cleaner language.

“We wanted to find a way with any existing language model [that], during the generation process, the decoding can be subject to some human values; the example here we are taking is toxicity,” explains Ching-Yun “Irene” Ko PhD ’24, lead author of the study and a research scientist at IBM’s Thomas J. Watson Research Center.

The training of LLMs invariably includes content scraped from the internet, including curse words and inappropriate language. This can lead LLMs to generate harmful or biased content, even from seemingly innocuous prompts. SASA addresses this by building a linear classifier that operates on the learned subspace from the LLM’s embedding. This classifier distinguishes between toxic and non-toxic language based on contextual information captured within the LLM’s vector space.

The SASA system re-weights the sampling probabilities of potential new tokens based on their value and the generated phrase’s distance to the classifier. The aim is to remain close to the original sampling distribution while avoiding toxic outputs. The model looks over its vocabulary for reasonable words, based on the preceding words. SASA then evaluates each of those tokens in the partially completed sentence for its proximity to the classifier. Tokens that produce sentences in the positive space are encouraged, while those in the negative space are penalized.

Researchers evaluated SASA against several baseline interventions using GPT2-Large, Llama2-7b, and Llama 3.1-8b-Instruct. The results demonstrated that SASA significantly reduced toxic language generation, performing on par with state-of-the-art external reward model techniques. While stronger detoxification sometimes led to a decrease in fluency, SASA also proved effective in mitigating gender bias in language generation.

According to Ko, SASA presents a well-defined optimization problem, allowing for a balance between natural-sounding language generation and the reduction of unwanted language. She also suggests that SASA could be extended to incorporate multiple attributes in the future, such as truthfulness, helpfulness, and loyalty, leading to more positive, fair, and principle-aligned language.

The research was supported, in part, by the MIT-IBM Watson AI Lab and the National Science Foundation.