Generative AI Test-Data Factory for Regulated Data Warehouses

Authors

  • Priya Dharshini Kalyanasundaram Amazon, USA Author
  • Vasudevan Ananthakrishnan Yakshna Solutions Inc, USA Author
  • Gayathri Salem Selvaraj Amtech Analytics, USA Author

Keywords:

synthetic data, diffusion models, data privacy, regulated domains, AI model testing

Abstract

As more individuals use AI and ML solutions with regulated data, we need ways to produce synthetic data that is useful and private. The objective of this paper is to examines how diffusion-based generative models can evaluate regulated warehouse data by using strict fidelity and policy compliance measures, through which models mimic production data distributions while meeting privacy and governance needs. 

Downloads

Download data is not yet available.

References

J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.

P. Dhariwal and A. Nichol, “Diffusion Models Beat GANs on Image Synthesis,” Advances in Neural Information Processing Systems, vol. 34, pp. 8780–8794, 2021.

L. Xu, S. Skoularidou, D. Cuesta-Infante, and K. Veeramachaneni, “Modeling Tabular data using Conditional GAN,” Advances in Neural Information Processing Systems, vol. 32, pp. 7335–7345, 2019.

K. Jordon, J. Yoon, and M. van der Schaar, “PATE-GAN: Generating Synthetic Data with Privacy Guarantees,” International Conference on Learning Representations, 2019.

C. Dwork, A. Roth, “The Algorithmic Foundations of Differential Privacy,” Foundations and Trends® in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014.

N. Papernot, M. Abadi, Ú. Erlingsson, I. Goodfellow, and K. Talwar, “Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data,” International Conference on Learning Representations, 2017.

J. Ying et al., “Generative Adversarial Networks for Synthetic Health Data Generation: A Systematic Review,” Journal of the American Medical Informatics Association, vol. 27, no. 12, pp. 1876–1885, 2020.

S. Shokri and V. Shmatikov, “Privacy-preserving Deep Learning,” Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1310–1321, 2015.

S. Lee et al., “Synthetic Data Generation for Privacy-Preserving Machine Learning: Survey and Benchmark,” arXiv preprint arXiv:2011.09445, 2020.

J. T. Choi, J. Ryu, and W. Yu, “A Survey on Privacy-Preserving Data Publishing Techniques,” IEEE Access, vol. 7, pp. 146054–146068, 2019.

P. Xu, M. Wang, and C. Zong, “Privacy-aware Synthetic Data Generation with Wasserstein GANs,” Information Sciences, vol. 552, pp. 146–160, 2021.

J. Baik, A. Mohamed, and H. Zhang, “Differentially Private Data Synthesis Using Generative Models,” IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 8, pp. 3056–3069, 2021.

S. Torkzadehmahani, M. Shokri, and C. Singla, “DP-CGAN: Differentially Private Synthetic Data and Label Generation,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 6902–6909, 2020.

C. Park, J. H. Kim, and J. Lee, “Privately Generating Synthetic Tabular Data with Differentially Private Variational Autoencoder,” IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 6, pp. 2652–2666, 2021.

A. Beaulieu-Jones et al., “Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing,” Circulation: Cardiovascular Quality and Outcomes, vol. 12, no. 7, 2019.

Y. Wang et al., “DP-WGAN: Differentially Private Synthetic Data Generation Using Wasserstein GAN,” arXiv preprint arXiv:2007.01165, 2020.

T. Jordon, J. Yoon, and M. van der Schaar, “Synthetic Data Generation for Regulated Healthcare Data: Techniques and Challenges,” Journal of Biomedical Informatics, vol. 112, 103614, 2020.

M. Backes, M. Döttling, and D. Unruh, “Differential Privacy for Complex Data Types in Enterprise Systems,” Proceedings of the IEEE Symposium on Security and Privacy, pp. 1133–1148, 2021.

M. Chen, W. Wei, and Y. Yu, “A Survey on Synthetic Data Generation and Its Applications,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 9, pp. 3764–3777, 2021.

B. Jiang, F. Chen, and Y. Liu, “Compliance and Governance in Cloud Data Warehousing: Challenges and Techniques,” IEEE Cloud Computing, vol. 8, no. 1, pp. 26–36, 2021.

Downloads

Published

30-06-2022

How to Cite

[1]
Priya Dharshini Kalyanasundaram, Vasudevan Ananthakrishnan, and Gayathri Salem Selvaraj, “Generative AI Test-Data Factory for Regulated Data Warehouses”, American J Data Sci Artif Intell Innov, vol. 2, pp. 510–542, Jun. 2022, Accessed: Mar. 07, 2026. [Online]. Available: https://ajdsai.org/index.php/publication/article/view/82