New Method Efficiently Safeguards Sensitive AI Training Data

Data privacy often comes at a cost, especially when it comes to AI. Security techniques designed to protect sensitive user data from being extracted from AI models can sometimes reduce the accuracy of those models. However, researchers at MIT have made significant strides in addressing this challenge.

They’ve developed a framework based on a new privacy metric called PAC Privacy. This innovative approach aims to maintain the performance of an AI model while ensuring that sensitive data, such as medical images or financial records, remain safe from potential attackers. Building on their initial work, the MIT team has now enhanced their technique to be more computationally efficient, improve the accuracy-privacy tradeoff, and create a formal template that can be applied to virtually any algorithm without needing access to its internal operations.

The team successfully utilized their refined version of PAC Privacy to privatize several classic algorithms used in data analysis and machine-learning tasks. This demonstrates the versatility and broad applicability of their framework.

A key finding from their research is that more “stable” algorithms are easier to privatize using their method. A stable algorithm is one whose predictions remain consistent even when slight modifications are made to its training data. This stability is crucial, as it helps the algorithm make more accurate predictions on previously unseen data. The researchers argue that the increased efficiency of the new PAC Privacy framework, coupled with the four-step template they’ve created, would make the technique more accessible and easier to implement in real-world scenarios.

According to Mayuri Sridhar, an MIT graduate student and lead author of the paper on this privacy framework, there is often a perceived conflict between robustness, privacy, and high-performance algorithms. Sridhar suggests that making an algorithm perform better in diverse settings can essentially provide privacy as a byproduct.

The research team also includes Hanshen Xiao, who will be joining Purdue University as an assistant professor, and Srini Devadas, the Edwin Sibley Webster Professor of Electrical Engineering at MIT. Their findings were presented at the IEEE Symposium on Security and Privacy.

One of the core strategies for protecting sensitive data used to train AI models is adding noise, or generic randomness, to the model. This makes it more difficult for adversaries to infer the original training data. However, this noise can also reduce the model’s accuracy. PAC Privacy addresses this by automatically estimating the minimum amount of noise needed to achieve a desired level of privacy.

The original PAC Privacy algorithm operates by running an AI model multiple times on different samples of a dataset. By measuring the variance and correlations among the outputs, it estimates the necessary amount of noise. The new variant of PAC Privacy simplifies this process by only requiring the output variances, significantly reducing computational demands and enabling scaling to larger datasets.

Sridhar explains that because the estimation is much smaller than the entire covariance matrix, the process is much faster. Furthermore, while the original algorithm was limited to adding isotropic noise, the new variant estimates anisotropic noise, which is tailored to specific characteristics of the training data. This allows users to add less overall noise while achieving the same level of privacy, thereby boosting the accuracy of the privatized algorithm.

Sridhar tested her hypothesis that more stable algorithms are easier to privatize using this technique by studying PAC Privacy. The results indicated that algorithms with less variance in their outputs when their training data changes slightly were indeed easier to privatize. By employing stability techniques to decrease variance, the amount of noise needed for privatization can be reduced, potentially leading to “win-win” scenarios where both privacy and performance are enhanced.

The team demonstrated that their privacy guarantees remained robust across different algorithms and that the new PAC Privacy variant required significantly fewer trials to estimate the noise. Attack simulations further validated the method’s ability to withstand state-of-the-art attacks.

Devadas emphasizes the importance of co-designing algorithms with PAC Privacy to enhance stability, security, and robustness from the outset. The researchers plan to extend their method to more complex algorithms and further investigate the privacy-utility tradeoff.

Xiangyao Yu, an assistant professor at the University of Wisconsin at Madison, who was not involved in the study, highlights PAC Privacy’s key advantage: its black-box nature, which automates the privatization process without requiring manual analysis of individual queries.

</n

This research is supported by Cisco Systems, Capital One, the U.S. Department of Defense, and a MathWorks Fellowship.