Data-Centric AI in the Era of Large Volumes: Improving Model Outcomes through Data Quality Engineering
Keywords:
Data-Centric AI, Data Quality Engineering, Big DataAbstract
Nowadays, the whole AI system’ success is determined not only by algorithms but also, and even more importantly, by the data quality that the systems use. The rise in the amount, speed, and diversity of data has reached the point where it is no longer possible to rely only on a model-centric approach. This article tracks the extent of the growth of data-centric AI which addresses the improvement of the data itself as the main lever for better model outcomes. At the core of this transformation lies data quality engineering—a highly disciplined and oftentimes neglected profession that goes way beyond performing the research on datasets and running formal verification tests, but it is basically the very essence of completeness, consistency, and context appropriateness of the datasets in question prior to their being used in any modeling work. In high-volume environments that are prone to falling, minor inconsistencies may become a big headache and using the preventative data hygiene approach in this context becomes indispensable. The paper looks into ways of how implementing data quality pipelines, anomaly detection, labeling integrity checks, and feedback loops into AI workflows not only makes models more reliable but also minimizes retraining costs and shortens deployment cycles. The embedding of data quality engineering into the development lifecycle has facilitated organizations™ transition from being reactive to the failures to a culture of continuous improvement driven by trusted data. This transformation allows teams to focus more on innovation and less on debugging. In the final analysis, this mind shift paper argues that in the world of such a massive increase of data, a data-centric approach—rooted in engineering discipline—is the most effective way of achieving scalable and robust AI.
Downloads
References
Abedjan, Ziawasch. "Enabling data-centric AI through data quality management and data literacy." it-Information Technology 64.1-2 (2022): 67-70.
Mishra, Sarbaree, et al. “A Domain Driven Data Architecture for Improving Data Quality in Distributed Datasets”. International Journal of Emerging Trends in Computer Science and Information Technology, vol. 2, no. 3, Oct. 2021, pp. 81-90
Mohammad, Abdul Jabbar. “Sentiment-Driven Scheduling Optimizer”. International Journal of Emerging Research in Engineering and Technology, vol. 1, no. 2, June 2020, pp. 50-59
Guntupalli, Bhavitha. “Asynchronous Programming in Java Python: A Developer’s Guide”. International Journal of Emerging Research in Engineering and Technology, vol. 3, no. 2, June 2022, pp. 70-78
Datla, Lalith Sriram, and Rishi Krishna Thodupunuri. “Designing for Defense: How We Embedded Security Principles into Cloud-Native Web Application Architectures”. International Journal of Emerging Research in Engineering and Technology, vol. 2, no. 4, Dec. 2021, pp. 30-38
Bhowmik, Pritom, and Arabinda Saha Partha. "A data-centric approach to improve machine learning model’s performance in production." Int. J. Eng. Adv. Technol.(IJEAT) 11 (2021): 240-243.
Nookala, Guruprasad. "Internal and External Audit Preparation for Risk and Controls." International Journal of Digital Innovation 2.1 (2021).
Jarrahi, Mohammad Hossein, Ali Memariani, and Shion Guha. "The principles of data-centric AI (DCAI)." arXiv preprint arXiv:2211.14611 (2022).
Vasanta Kumar Tarra, and Arun Kumar Mittapelly. “Voice AI in Salesforce CRM: The Impact of Speech Recognition and NLP in Customer Interaction Within Salesforce’s Voice Cloud”. Newark Journal of Human-Centric AI and Robotics Interaction, vol. 3, Aug. 2023, pp. 264-82
Immaneni, J. (2022). Strengthening Fraud Detection with Swarm Intelligence and Graph Analytics. International Journal of Digital Innovation, 3(1).
Polyzotis, Neoklis, and Matei Zaharia. "What can data-centric AI learn from data and ML engineering?." arXiv preprint arXiv:2112.06439 (2021).
Talakola, Swetha. “Exploring the Effectiveness of End-to-End Testing Frameworks in Modern Web Development”. International Journal of Emerging Research in Engineering and Technology, vol. 3, no. 3, Oct. 2022, pp. 29-39
Mishra, Sarbaree. “Leveraging Cloud Object Storage Mechanisms for Analyzing Massive Datasets”. International Journal of Emerging Research in Engineering and Technology, vol. 2, no. 1, Mar. 2021, pp. 47-56
Maimun, A., et al. "Digital Transformation Through Data-driven and Data-centric Approaches with Artificial Intelligence." International Conference on Computer Application in Shipbuilding, Yokohama, Japan. https://doi. org/10.3940/rina. iccas. 2022.
Abdul Jabbar Mohammad, and Seshagiri Nageneini. “Blockchain-Based Timekeeping for Transparent, Tamper-Proof Labor Records”. European Journal of Quantum Computing and Intelligent Agents, vol. 6, Dec. 2022, pp. 1-27
Patel, Piyushkumar. "Navigating the BEAT (Base Erosion and Anti-Abuse Tax) under the TCJA: The Impact on Multinationals’ Tax Strategies." Australian Journal of Machine Learning Research & Applications 2.2 (2022): 342-6.
Gerdes, Anne. "A participatory data-centric approach to AI ethics by design." Applied Artificial Intelligence 36.1 (2022): 2009222.
Manda, Jeevan Kumar. "AI-driven Network Orchestration in 5G Networks: Leveraging AI and Machine Learning for Dynamic Network Orchestration and Optimization in 5G Environments." Educational Research (IJMCER) 4.2 (2022): 356-365.
Shaik, Babulal. "Network Isolation Techniques in Multi-Tenant EKS Clusters." Distributed Learning and Broad Applications in Scientific Research 6 (2020).
Balkishan Arugula. “Knowledge Graphs in Banking: Enhancing Compliance, Risk Management, and Customer Insights”. European Journal of Quantum Computing and Intelligent Agents, vol. 6, Apr. 2022, pp. 28-55
Zhong, Yiqi, et al. "Exploiting the potential of datasets: A data-centric approach for model robustness." arXiv preprint arXiv:2203.05323 (2022).
Mohammad, Abdul Jabbar. “AI-Augmented Time Theft Detection System”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 2, no. 3, Oct. 2021, pp. 30-38
Seedat, Nabeel, Fergus Imrie, and Mihaela van der Schaar. "Dc-check: A data-centric ai checklist to guide the development of reliable machine learning systems." arXiv preprint arXiv:2211.05764 (2022).
Jani, Parth, and Sangeeta Anand. “Apache Iceberg for Longitudinal Patient Record Versioning in Cloud Data Lakes”. Essex Journal of AI Ethics and Responsible Innovation, vol. 1, Sept. 2021, pp. 338-57
Mishra, Sarbaree. “Reducing Points of Failure - A Hybrid and Multi-Cloud Deployment Strategy With Snowflake”. International Journal of AI, BigData, Computational and Management Studies, vol. 3, no. 1, Mar. 2022, pp. 66-78
Chuprina, Tatiana, Daniel Mendez, and Krzysztof Wnuk. "Towards artefact-based requirements engineering for data-centric systems." arXiv preprint arXiv:2103.05233 (2021).
Veluru, Sai Prasad. "Leveraging AI and ML for Automated Incident Resolution in Cloud Infrastructure." International Journal of Artificial Intelligence, Data Science, and Machine Learning 2.2 (2021): 51-61.
Aversa, Marco, et al. "Data-centric AI workflow based on compressed raw images." Proceedings of the OBPDC2022-8th Internationl Worshop on Onboard payload data compression, 28-30 September 2022, Athens, Greece (2022).
Patel, Piyushkumar. "The Corporate Transparency Act: Implications for Financial Reporting and Beneficial Ownership Disclosure." Journal of Artificial Intelligence Research and Applications 2.1 (2022): 489-08.
Mishra, Sarbaree. “Comparing Apache Iceberg and Databricks in Building Data Lakes and Mesh Architectures”. International Journal of AI, BigData, Computational and Management Studies, vol. 3, no. 4, Dec. 2022, pp. 37-48
Amrani, Hamza. Model-centric and data-centric AI for personalization in human activity recognition. Diss. Ph. D. thesis, University of Milano-Bicocca, 2021.
Datla, Lalith Sriram, and Rishi Krishna Thodupunuri. “Methodological Approach to Agile Development in Startups: Applying Software Engineering Best Practices”. International Journal of AI, BigData, Computational and Management Studies, vol. 2, no. 3, Oct. 2021, pp. 34-45
BALAHUR-DOBRESCU, Alexandra, et al. "Data quality requirements for inclusive, non-biased and trustworthy AI." (2022).
Jani, Parth, and Sarbaree Mishra. "Data Mesh in Federally Funded Healthcare Networks." The Distributed Learning and Broad Applications in Scientific Research 6 (2020): 1146-1176.
Renggli, Cedric. Building Data-Centric Systems for Machine Learning Development and Operations. Diss. ETH Zurich, 2022.
Mishra, Sarbaree. “A Reinforcement Learning Approach for Training Complex Decision Making Models”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 3, no. 3, Oct. 2022, pp. 82-92
Datla, Lalith Sriram. “Infrastructure That Scales Itself: How We Used DevOps to Support Rapid Growth in Insurance Products for Schools and Hospitals”. International Journal of AI, BigData, Computational and Management Studies, vol. 3, no. 1, Mar. 2022, pp. 56-65
Parmar, Tarun. "Data-centric Approach to Decision Making in Semiconductor Manufacturing: Best Practices and Future Directions." (2021).
Guntupalli, Bhavitha. “How I Debug Complex Issues in Large Codebases”. International Journal of Emerging Research in Engineering and Technology, vol. 1, no. 1, Mar. 2020, pp. 67-76
Nookala, G., Gade, K. R., Dulam, N., & Thumburu, S. K. R. (2022). The Shift Towards Distributed Data Architectures in Cloud Environments. Innovative Computer Sciences Journal, 8(1).
Balkishan Arugula, and Pavan Perala. “Multi-Technology Integration: Challenges and Solutions in Heterogeneous IT Environments”. American Journal of Cognitive Computing and AI Systems, vol. 6, Feb. 2022, pp. 26-52
Chaganti, Krishna C. "Leveraging Generative AI for Proactive Threat Intelligence: Opportunities and Risks." Authorea Preprints.
Shaik, Babulal. "Automating Compliance in Amazon EKS Clusters With Custom Policies." Journal of Artificial Intelligence Research and Applications 1.1 (2021): 587-10.
Manda, J. K. "Data privacy and GDPR compliance in telecom: ensuring compliance with data privacy regulations like GDPR in telecom data handling and customer information management." MZ Comput J 3.1 (2022).
Abdul Jabbar Mohammad. “Cross-Platform Timekeeping Systems for a Multi-Generational Workforce”. American Journal of Cognitive Computing and AI Systems, vol. 5, Dec. 2021, pp. 1-22
Jani, Parth. "Predicting Eligibility Gaps in CHIP Using BigQuery ML and Snowflake External Functions." International Journal of Emerging Trends in Computer Science and Information Technology 3.2 (2022): 42-52.
Shaik, Babulal. "Developing Predictive Autoscaling Algorithms for Variable Traffic Patterns." Journal of Bioinformatics and Artificial Intelligence 1.2 (2021): 71-90
Pentyala, Dillep Kumar. "Enhancing the Reliability of Data Pipelines in Cloud Infrastructures Through AI-Driven Solutions." The Computertech (2020): 30-49.