UC Riverside Scientists Develop Certified Unlearning Method To Erase Data From AI Models Without Retraining

Download PDF Copy

University of California - RiversideSep 2 2025

By combining surrogate datasets with noise calibration, UC Riverside researchers show how AI models can provably “forget” sensitive data, offering a breakthrough in privacy, compliance, and the responsible future of machine learning.

Research: A Certified Unlearning Approach without Access to Source Data. Image Credit: Chim / Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

A team of computer scientists at UC Riverside has developed a method to erase private and copyrighted data from artificial intelligence models without requiring access to the original training data.

This advance, detailed in a paper published on the arXiv preprint* server that was presented at the Proceedings of the 42nd International Conference on Machine Learning (ICML) in Vancouver, Canada, addresses a rising global concern about personal and copyrighted materials remaining in AI models indefinitely, and thus accessible to model users, despite efforts by the original creators to delete or guard their information with paywalls and passwords.

Forgetting Without Retraining

The UCR innovation compels AI models to "forget" selected information while maintaining the models' functionality with the remaining data. It's a significant advancement that can amend models without having to re-make them with the voluminous original training data, which is costly and energy-intensive. The approach also enables the removal of private information from AI models even when the original training data is no longer available.

"In real-world situations, you can't always go back and get the original data," said Ümit Yiğit Başaran, a UCR electrical and computer engineering doctoral student and lead author of the study. "We've created a certified framework that works even when that data is no longer available."

Privacy and Copyright Concerns

The need is pressing. Tech companies face new privacy laws, such as the European Union's General Data Protection Regulation and California's Consumer Privacy Act, which govern the security of personal data embedded in large-scale machine learning systems.

Moreover, The New York Times is suing OpenAI and Microsoft over the use of its many copyrighted articles to train Generative Pre-trained Transformer, or GPT, models.

AI models "learn" the patterns of words from a vast amount of texts scraped from the Internet. When queried, the models predict the most likely word combinations, generating natural-language responses to user prompts. Sometimes they generate near-verbatim reproductions of the training texts, allowing users to bypass the paywalls of the content creators.

Certified Unlearning with Surrogate Data

The UC Riverside research team, comprising Başaran, professor Amit Roy-Chowdhury, and assistant professor Başak Güler, developed what they call “a certified unlearning method that does not require access to the original training data.” The technique allows AI developers to remove targeted data by using a substitute, or "surrogate," dataset that statistically resembles the original data.

The system adjusts model parameters and adds carefully calibrated random noise to ensure the targeted information achieves (ε, δ)-certified unlearning by calibrating noise based on the estimated distance between source and surrogate distributions.

Their framework builds on a concept in AI optimization that efficiently approximates how a model would change if it had been retrained from scratch. The UCR team enhanced this approach with a new noise-calibration mechanism that compensates for discrepancies between the original and surrogate datasets.

Validation and Results

The researchers validated their method using both synthetic and real-world datasets. They found it provided utility and unlearning metrics comparable to those achieved with full retraining—while requiring far fewer resources.

The current work applies to classification and mixed-linear models, which are still widely used in practice. The team hopes that future refinements will extend to more complex systems.

Implications and Next Steps

Beyond regulatory compliance, the technique holds promise for media organizations, medical institutions, and others handling sensitive data embedded in AI models, the researchers said. It could also empower people to demand the removal of personal or copyrighted content from AI systems.

"People deserve to know their data can be erased from machine learning models, not just in theory, but in provable, practical ways," Güler said.

The team's next steps involve refining the method to work with more complex model types and datasets, as well as building tools to make the technology accessible to AI developers worldwide.

Study Details

The title of the paper is "A Certified Unlearning Approach without Access to Source Data." It was done in collaboration with Sk Miraj Ahmed, a computational science research associate at the Brookhaven National Laboratory in Upton, NY. Both Roy-Chowdhury and Güler are faculty members in the Department of Electrical and Computer Engineering.

Source:

University of California - Riverside

Journal reference:

Preliminary scientific report. Basaran, U. Y., Ahmed, S. M., & Guler, B. (2025). A Certified Unlearning Approach without Access to Source Data. ArXiv. https://arxiv.org/abs/2506.06486

Posted in: AI Research News

Comments (0)

Download PDF Copy

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.

Post a new comment

(Logout)

Post

Sign in to keep reading

We're committed to providing free access to quality science. By registering and providing insight into your preferences you're joining a community of over 1m science interested individuals and help us to provide you with insightful content whilst keeping our service free.