A team of computer scientists at UC Riverside has developed a method to erase private and copyrighted data from artificial intelligence models without requiring access to the original training data.
This advance, detailed in a paper published on the arXiv preprint* server that was presented at the Proceedings of the 42nd International Conference on Machine Learning (ICML) in Vancouver, Canada, addresses a rising global concern about personal and copyrighted materials remaining in AI models indefinitely, and thus accessible to model users, despite efforts by the original creators to delete or guard their information with paywalls and passwords.
Forgetting Without Retraining
The UCR innovation compels AI models to "forget" selected information while maintaining the models' functionality with the remaining data. It's a significant advancement that can amend models without having to re-make them with the voluminous original training data, which is costly and energy-intensive. The approach also enables the removal of private information from AI models even when the original training data is no longer available.
"In real-world situations, you can't always go back and get the original data," said Ümit Yiğit Başaran, a UCR electrical and computer engineering doctoral student and lead author of the study. "We've created a certified framework that works even when that data is no longer available."
Privacy and Copyright Concerns
The need is pressing. Tech companies face new privacy laws, such as the European Union's General Data Protection Regulation and California's Consumer Privacy Act, which govern the security of personal data embedded in large-scale machine learning systems.
Moreover, The New York Times is suing OpenAI and Microsoft over the use of its many copyrighted articles to train Generative Pre-trained Transformer, or GPT, models.
AI models "learn" the patterns of words from a vast amount of texts scraped from the Internet. When queried, the models predict the most likely word combinations, generating natural-language responses to user prompts. Sometimes they generate near-verbatim reproductions of the training texts, allowing users to bypass the paywalls of the content creators.
Certified Unlearning with Surrogate Data
The UC Riverside research team, comprising Başaran, professor Amit Roy-Chowdhury, and assistant professor Başak Güler, developed what they call “a certified unlearning method that does not require access to the original training data.” The technique allows AI developers to remove targeted data by using a substitute, or "surrogate," dataset that statistically resembles the original data.
The system adjusts model parameters and adds carefully calibrated random noise to ensure the targeted information achieves (ε, δ)-certified unlearning by calibrating noise based on the estimated distance between source and surrogate distributions.
Their framework builds on a concept in AI optimization that efficiently approximates how a model would change if it had been retrained from scratch. The UCR team enhanced this approach with a new noise-calibration mechanism that compensates for discrepancies between the original and surrogate datasets.
Validation and Results
The researchers validated their method using both synthetic and real-world datasets. They found it provided utility and unlearning metrics comparable to those achieved with full retraining—while requiring far fewer resources.
The current work applies to classification and mixed-linear models, which are still widely used in practice. The team hopes that future refinements will extend to more complex systems.
Implications and Next Steps
Beyond regulatory compliance, the technique holds promise for media organizations, medical institutions, and others handling sensitive data embedded in AI models, the researchers said. It could also empower people to demand the removal of personal or copyrighted content from AI systems.
"People deserve to know their data can be erased from machine learning models, not just in theory, but in provable, practical ways," Güler said.
The team's next steps involve refining the method to work with more complex model types and datasets, as well as building tools to make the technology accessible to AI developers worldwide.
Study Details
The title of the paper is "A Certified Unlearning Approach without Access to Source Data." It was done in collaboration with Sk Miraj Ahmed, a computational science research associate at the Brookhaven National Laboratory in Upton, NY. Both Roy-Chowdhury and Güler are faculty members in the Department of Electrical and Computer Engineering.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Source:
Journal reference:
- Preliminary scientific report.
Basaran, U. Y., Ahmed, S. M., & Guler, B. (2025). A Certified Unlearning Approach without Access to Source Data. ArXiv. https://arxiv.org/abs/2506.06486