top of page

The Paradox of Synthetic Data: Legal and Ethical Risks

  • Writer: Roser Almenar
    Roser Almenar
  • Oct 21
  • 8 min read

Author:

Roser Almenar - PhD Candidate in AI & Space Law, University of Valencia (Spain)

International Telecommunication Union (ITU) Secretary-General’s Youth Advisory Board Member


Abstract


This article explores the paradoxical role of synthetic data in the context of algorithmic discrimination. While synthetic data is increasingly promoted as a means to mitigate bias, protect privacy, and foster more equitable Artificial Intelligence (AI) systems, its deployment raises complex legal and ethical concerns. Issues of transparency, accountability, and compliance with international Human Rights frameworks, including the GDPR, remain unresolved.

This examination argues that synthetic data should not be viewed as a purely technical fix but as a socio-legal challenge that requires robust governance, independent oversight, and alignment with democratic principles. Ultimately, the future of trustworthy AI will depend on how effectively policymakers, technologists, and civil society manage the double-edged potential of synthetic data as both a tool for fairness and a source of new risks.

 

Rethinking Fairness in AI: Why Synthetic Data Matters


As AI increasingly shapes decisions that affect people’s daily lives, concerns about algorithmic bias and its harmful consequences have become impossible to ignore. Whether in automated hiring platforms, credit scoring systems, or predictive policing tools, biased algorithms risk perpetuating inequality and seriously infringing upon Human Rights. These harms often originate in the datasets used to train AI, which reflect historical patterns of discrimination and exclusion.


To counter this, one proposed solution gaining traction is the use of synthetic data----that is, artificially generated datasets designed to replicate the statistical properties of real-world data while reducing the presence of sensitive or biased information. By offering new ways to rebalance datasets and protect privacy, synthetic data is increasingly viewed as a promising tool for promoting fairness in AI systems.


Yet, the use of synthetic data also introduces pressing challenges. Questions about transparency, accountability, and the protection of Human Rights remain unresolved, raising doubts about whether synthetic data can truly deliver on its promise without stronger governance. This article examines the potential benefits of synthetic data in addressing algorithmic bias, and the regulatory challenges that must be considered to ensure that its adoption aligns with democratic values and legal safeguards.


Understanding Synthetic Data in the Context of AI Governance


Synthetic data refers to information that is not collected from real individuals but rather generated through computational methods, including algorithms, simulations, or generative models such as Generative Adversarial Networks (GANs).

As defined by the U.S. National Institute of Standards and Technology (NIST), synthetic data generation can be described as “a process in which seed data are used to create artificial data that have some of the statistical characteristics of the seed data.” 

This characteristic positions synthetic data as a distinctive resource within the broader debate on data governance and rights protection.


One of its primary advantages lies in its capacity to enhance privacy protection. Because synthetic data is artificially produced, it significantly reduces the risk of exposing identifiable personal information, thereby mitigating potential infringements upon data protection frameworks. Additionally, synthetic data can contribute to addressing structural inequalities by allowing data engineers to rebalance datasets and ensure a fairer representation of groups historically underrepresented in real-world data. Equally important is its scalability: synthetic datasets can be generated in large volumes, offering valuable opportunities for testing and training AI models without the same legal or ethical constraints associated with the processing of real personal data.


Nevertheless, the nature of synthetic data---constructed rather than observed----raises critical questions concerning its representativeness and authenticity. If not carefully designed and validated, synthetic datasets may inadvertently reproduce existing biases or fail to capture the complexity of social realities.

These limitations highlight the need for rigorous evaluation and legal scrutiny to determine under what conditions synthetic data can legitimately be integrated into decision-making systems that carry significant consequences for individuals and society.

 

Regulatory and Governance Challenges


Although synthetic data presents considerable technical advantages, its integration into AI systems cannot be understood in isolation from the regulatory frameworks that govern data protection, equality, and non-discrimination. The legal implications of synthetic data extend beyond mere technical considerations and touch upon fundamental questions of transparency, accountability, and the protection of Human Rights.


One of the most pressing concerns is transparency. 

As a foundational principle of trustworthy AI, transparency requires that the processes by which datasets are generated and applied are open to scrutiny and verification. Synthetic data, however, is often produced through opaque or proprietary methods that resist external evaluation. Without clear disclosure of how synthetic datasets are created, stakeholders cannot reliably assess their quality, nor identify potential biases embedded in the generation process. While the newly adopted EU AI Act introduces explicit transparency obligations for high-risk AI systems, it remains unclear how these requirements will apply to synthetic datasets, particularly when they are used as substitutes for, or complements to, real-world personal data.


Equally significant is the question of accountability.

Effective governance requires clear lines of responsibility when biased outcomes or harms occur. The involvement of synthetic data complicates this chain of responsibility: liability might rest with the entity generating the dataset, the developer of the AI model, or the final deployer of the system. In the absence of legal clarity, such ambiguity risks producing accountability gaps, which undermine the right to redress for individuals whose fundamental rights are adversely affected.


Moreover, synthetic data must be situated within the existing framework of data protection law, particularly the General Data Protection Regulation (GDPR). Although synthetic data is often described as “privacy-preserving,” its legal status under the GDPR is not straightforward. If synthetic datasets are generated in such a way that individuals can still be re-identified, either directly or indirectly, they fall within the definition of “personal data” under Article 4(1) GDPR, thereby triggering the full scope of compliance obligations. Even when synthetic data appears to be anonymized, the GDPR requires that anonymization be irreversible; any possibility of re-identification places the data in the category of pseudonymized information, subject to strict safeguards.

This illustrates that synthetic data cannot automatically be considered exempt from data protection rules, but must be carefully evaluated on a case-by-case basis.


Finally, rights protection remains paramount. International frameworks, such as the Council of Europe’s recommendations on AI and Human Rights, emphasize that technological innovations must not circumvent established legal safeguards. Even in the absence of direct personal identifiers, synthetic datasets can be used to facilitate discriminatory profiling or expand surveillance practices, with significant implications for privacy, equality, and due process. Regulators must therefore ensure that synthetic data is not deployed as a means to evade compliance with existing legal protections but is instead embedded within governance structures that uphold democratic accountability and the primacy of Human Rights.

 

Moving Forward: Best Practices for Responsible Synthetic Data Use


In order to harness the potential of synthetic data while avoiding its most significant risks, both policymakers and developers must adhere to a set of best practices that integrate legal, ethical, and technical considerations. A first step consists in conducting systematic Bias Impact Assessments (BIAs) to evaluate how synthetic datasets alter or reproduce bias dynamics within algorithmic systems. Such assessments allow for the early identification of discriminatory outcomes and facilitate compliance with equality and non-discrimination obligations.


Equally important is the establishment of independent auditing mechanisms. External verification of synthetic data generation and deployment ensures not only methodological rigor but also public accountability, particularly where systems are used in contexts with high stakes for individual rights. Independent audits can serve as a safeguard against opacity and provide regulators with the evidence necessary to enforce compliance with frameworks such as the GDPR or the EU AI Act.


A third practice relates to stakeholder engagement.


The participation of affected communities in the design, testing, and evaluation of synthetic data solutions ensures that the values of inclusivity and democratic oversight are embedded in technological development. Such stakeholder engagement strengthens the legitimacy of AI systems and fosters social trust.


Finally, the principles of proportionality and necessity----well established in European fundamental rights jurisprudence----should guide the use of synthetic data. Policymakers and developers must ensure that synthetic datasets are employed only where their use is genuinely required and proportionate to the specific problem at hand. This prevents the unnecessary expansion of synthetic data applications into domains where traditional safeguards or real-world data may be more appropriate.

 

Conclusion


Synthetic data holds significant transformative potential. Properly designed and deployed, it can contribute to the mitigation of algorithmic bias, strengthen privacy protection, and foster the development of more equitable and socially responsive AI systems. Nonetheless, in the absence of robust governance structures, synthetic data also carries the risk of becoming a tool that perpetuates opacity, undermines accountability, and creates new threats to the protection of Human Rights.


Addressing this ambivalence requires a governance approach that is both comprehensive and multi-stakeholder.

Regulators must provide clear and enforceable legal standards; technologists must ensure that technical design choices align with ethical and rights-based principles; and civil society must be empowered to exercise oversight and advocate for those most vulnerable to algorithmic harms. Only through this collaborative framework can synthetic data be integrated into AI ecosystems in a manner consistent with democratic values and fundamental rights protections.


Ultimately, the trajectory of synthetic data will test our collective capacity to balance innovation with legal and ethical safeguards. The future of fair and trustworthy AI will depend not solely on technological progress, but on the normative choices we make today to ensure that emerging tools serve the public interest rather than erode the principles of transparency, accountability, and rights protection that underpin the rule of law.


Bibliography


Deng, H., 2023. “Exploring Synthetic Data for Artificial Intelligence and Autonomous Systems: A Primer”. Geneva, Switzerland: UNIDIR, 32 pp. Available at: https://unidir.org/wpcontent/uploads/2023/11/UNIDIR_Exploring_Synthetic_Data_for_Artificial_Intelligence_and_Autonomous_Systems_A_Primer.pdf


Fonseca, J., and Bacao, F., 2023. “Tabular and latent space synthetic data generation: a literature review”. Journal of Big Data, 10, article 115, pp. 1-37. Available at: https://doi.org/10.1186/s40537-023-00792-7


Hradec, J., Di Leo, M., and Kotsev, A., 2024. “AI Generated Synthetic Data in Policy Applications”. European Commission, Ispra, JRC138521. Available at: https://publications.jrc.ec.europa.eu/repository/handle/JRC138521


Jordon, J. et al., 2024. “Synthetic Data – what, why and how?”, report commissioned by the Royal Society, 56 pp. Available at: https://royalsociety.org/-/media/policy/projects/privacy-enhancing-technologies/Synthetic_Data_Survey-24.pdf


Lee, P., 2025. “Synthetic Data and the Future of AI”. Cornell Law Review, vol. 110, pp. 1-74.


Shahul Hameed, M. A., Qureshi, A. M., and Kaushik, A., 2024. “Bias Mitigation via Synthetic Data Generation: A Review”. Electronics, 13(19), article 3909, 14 pp. 9. Available at: https://doi.org/10.3390/electronics13193909


Toh, S.-L., and Park, J., 2025. “Fake It Till You Make It: Synthetic Data and Algorithmic Bias”. International Journal of Communication, 19 (Forum), pp. 1852-1858. Available at: https://ijoc.org/index.php/ijoc/article/view/24000


Biography of the Guest Expert


Roser Almenar is a PhD Candidate in AI & Space Law at the University of Valencia (Spain), and serves as a member of the ITU Secretary-General’s Youth Advisory Board, representing Europe. She is the current Co-Lead of the Space Law and Policy Project Group of the Space Generation Advisory Council (SGAC), where she also co-spearheads the “AI and Space Law” Research Group.


In addition, she is a Legal Research Officer at Space Court Foundation (SCF) and has contributed as a co-author to the recently published report on “Balancing Innovation and Responsibility: International Recommendations for AI Regulation in Space,” elaborated under the auspices of the International Institute of Space Law (IISL).


Her research work focuses on the disciplines of Space Law and Policy (dealing with remote sensing technologies, AI, and data protection, among others) Telecommunications Law, and the impact of technological advances in the protection of human rights from a Private Law perspective.

 
 
 

Recent Posts

See All
bottom of page