top of page

Can General-Purpose AI Coexist with the Current Data Protection Regulations?

  • Writer: Tainá Silveira Baylao
    Tainá Silveira Baylao
  • 12 hours ago
  • 7 min read

Author:

Tainá Baylao - Lawyer specialized in Data Protection & AI Governance


Abstract


ChatGPT, Gemini, Claude, Lhama. For the past three years general-purpose AI models have taken society by a landslide.

These technologies are now integrated into almost every single business model and are even being used as a new type of “Google” for information search. Despite its rapid evolution and increased relevance, this technology is not exempt from complying with regulations. This article will focus on the challenges involved when it comes to the intersection of AI technology and data protection regulations.


These will divided into the following categories:


●       Purpose limitation and data minimization;

●       Transparency;

●       Accuracy;

●       Storage limitation.

 

Each of these will be analyzed and evaluated primarily under the scope of European Data Protection Regulation (GDPR), as this is a pioneer regulation in the field and one of the strictest. GDPR is considered as the gold standard for data protection worldwide.


Purpose Limitation and Data Minimization


It is no news that artificial intelligence (AI) models require vast amounts of text, images, and videos to be trained–this is even recognized in Recital 105 of the European Union Artificial Intelligence Act (EU AI Act). To predict well, it needs to have seen enough examples and variations of contexts. If the dataset happens to be too small, the model may overfit (memorize data instead of generalizing it) (1), which leads to problems when faced with new situations, as it cannot adapt itself properly.


This is a real issue when it comes to data protection regulations. The GDPR stipulates that personal data is only to be collected for specific and explicit purposes (purpose limitation).

This means that there needs to be a clear intention and reason behind the collection of data, which is not the case when you’re collecting indiscriminate data from all over the web to train general-purpose models. The developers of those models might have such use cases in mind, but since the idea is that those models serve very broad use cases and could answer questions about virtually anything, how can one define the purpose of such processing? (2)


On the same note, GDPR states that one can only process personal data if said data is adequate, relevant and necessary for the defined purpose (data minimization). Now, how can one define what is necessary if they can’t even define what is the reason to have that data? Where does one draw the line? Collecting simply as a 'nice-to-have' or for 'maybe someone will ask that question' is not a good enough reason to satisfy that principle.


In order to mitigate this risk, CNIL (3) (the French data protection authority) recommends some general boundary measures to avoid the collection of unnecessary data:


●       Define, upstream, the precise collection criteria;

●       Exclude certain categories of data from the collection e.g. bank details, geolocation, health data; children’s data; pseudonyms);

●       Filter websites that contain content that is not relevant to the defined purposes e.g. pornographic websites, patient discussion forums, websites in languages not supported by the AI model;

●       Exclude from the scope sites that clearly oppose the harvesting of their content e.g. by respecting CATPCHAs and exclusion protocols like robots.txt or ai.txt (4);

●       Exclude private profiles in social media networks and content under a paywall;

●       Use anonymized, pseudonymized or synthetic data as an alternative if possible.


Transparency


On the topic of transparency, it is required to provide transparency to individuals on how their data is being collected. This is often done via a Privacy Notice on a website or an application (app), and it can also be done by direct interaction with them through checkboxes and pop-ups. However, the situation is considerably more complicated when there is no direct relationship between the data subject and the controller and where the personal data has not been obtained directly from those individuals, such as in the case of web scraping. (5)


One may argue that obtaining such a vast pool of personal data from various sources would render such transparency impossible or a disproportionate effort to the LLM provider. It could even be that the information collected does not include the contact details necessary to reach out to data subjects. Nevertheless, this argument does not hold up as one must be careful not to interpret GDPR in a manner that distorts its intended protection purposes.


If the model developers cannot contact individuals directly to inform them about how they are processing their data, they should make efforts to make this information available. This includes:

●       Publishing a transparent Privacy Notice in plain language, including legal design elements (e.g. icons, explanatory videos, storytelling) and Frequently Asked Questions (FAQ);

●       Communicate in mass media about the vast data collection;

●       Create a trust center containing the main sources used for data collection.


Accuracy


On the topic of accuracy, we also run into a hurdle.

GDPR expects that all personal data processed is accurate, and in the case it is not, individuals can even request its correction. Now, since AI works on a probabilistic basis, where the model is somewhat “guessing” what is the next best word to complete a sentence, it often gets it wrong, it “hallucinates”.

Several incidents of that kind were reported in the media in the past few years, including individuals being accused of crimes they have not committed or falsely being reported as deceased. (6)


It could also be that AI is generating the wrong results simply because it was trained in incorrect data. “Garbage in, garbage out”. When collecting data from so many different sources and on such a large scale, data sanitization can be quite time and effort consuming.


It is also common that the data sources might update data after a while, but this update would not be reflected in the AI model unless retrained.

It could even be that attempts to correct a specific dataset fail. For example, one may alter the answer to the prompt “The last name of the president of the USA is?” from Trump to Macron. However, if someone asks in a different prompt, “Americans elected a president with the last name”, the model might still revert to the original answer, Trump. (7)


Therefore, for this specific issue, there are no good alternatives available yet, as scientists and scholars struggle to refine and improve the technology. However, the probabilistic nature cannot be fully overcomed.


Storage Limitation


GDPR stipulates that personal data shall be kept only as long as necessary to fulfill the defined purpose of processing. Once that retention period is reached, personal data must be deleted or kept in a manner which is no longer possible to identify an individual (anonymized). It could also be that an individual no longer wants their personal data to be processed and requests its deletion.

However, to comply with this principle and also when fulfilling requests, the same issue of accuracy arises, the way the technology is structured makes it difficult, if not impossible to remove that data from an already trained model.


The ideal approach would be to re-train the model, fine-tune it and apply machine unlearning techniques. Still, the computational costs and efforts are prohibitive to implement for every single request.

That is why machine unlearning techniques must be further developed and refined.


CNIL, also issued guidelines emphasizing that the development of technical solutions is encouraged, not only to fulfil deletion and correction requests, but also to provide the possibility to exercise the right to object upstream of data collection. “Push list” mechanisms could be implemented, making it possible to respect the opposition of individuals by refraining collection of their data in the first place. (8)


In the meantime, companies have been adopting an alternative approach to filter personal data to prevent it from being displayed as an output. While this is a valid and necessary mitigation measure, this approach is not enough to comply with the law, because at the end of the day, the personal data remains inside the model and a reframing of a question could still trigger the appearance of the personal data again.


Conclusion


The widespread adoption of general-purpose AI models indeed presents a significant challenge to existing data protection frameworks, particularly the GDPR. The core friction points revolve around the inherent design of these models, which often conflict with data protection principles. While some efforts have been made to mitigate them through guidelines by data protection authorities and alternative approaches taken by AI developers, the fundamental probabilistic nature of AI and the sheer scale of data involved make full compliance difficult.


The ongoing struggle to reconcile AI’s operational demands with robust data protection underscores the critical need to continued innovation in areas like machine unleaning, development of pragmatic interpretations and effective practical measures.


References


(1) European Cloud Federation, "Overfitting: A Conversation Starter for the Next Summer Party" (n 7); European Commission, Deep Dive into Artificial Intelligence and Data Ecosystems – Fundamental Rights, Ethics and Data Protection (n 7).


(2) Processing means any operation or set of operations which is performed on personal data such as collection, recording, organization, structuring, storage, adaptation, retrieval, consultation, use, disclosure, dissemination, combination, erasure.


(3) CNIL, 2025. "La base légale de l’intérêt légitime: fiche focus sur les mesures à prendre en cas de collecte des données par moissonnage (web scraping)."


(4) Transparency Coalition, 2025. "TCAI Urges Adoption of “Do Not Train” Data and Training Data Request Prompts"


(5) Web scraping refers to the automated, often indiscriminate collection of large volumes of online data, including personal data, using bots


(6) Jason Nelson, 2023. "ChatGPT wrongly accuses law professor", Yahoo! Finance.


(7) Henrik Nolte, Michèle Finck and Kristof Meding, 2025. "Machine Learners Should Acknowledge the Legal Implications of Large Language Models as Personal Data" Available at: https://arxiv.org/abs/2503.01630.


(8) CNIL (n 3).


Biography of the Guest Expert


Tainá is a Brazilian and Portuguese qualified lawyer, specialized in the areas of data protection and AI governance. During her constant pursuit of knowledge, she has obtained a LLM in IP & IT Law from University of Göttingen (Germany) and another in Privacy, Cybersecurity and Data Management from Maastricht University (Netherlands)


She also holds several privacy certifications, including FIP, CIPP/E, CIPM, CIPT, AIGP, CDPO/BR from IAPP


 
 
 

Recent Posts

See All
bottom of page