The intersection between semantics and data quality
Data quality management starts with the explicit definition of stakeholder expectations based on the intended use of the data, and the business rules that must be verified to ensure the data is of high quality.
These rules are defined in complement to the definition of domain concepts (e.g., “Customer”) and the attributes that characterize them (e.g., “Tax ID”), enriching the knowledge base modeled within the information architecture.
Business rules related to data quality serve as the reference for implementing quality controls within applications and data products, both preventive and corrective.
Business rules linked to the conceptual model guide data quality controls
Roles and Processes
To improve data quality, it is essential to act first from an organizational perspective, assigning explicit responsibility for quality to the roles involved in data and knowledge management processes.
The roles involved in data quality management
The main roles involved are:
• Data Owner and Data Steward: responsible for defining domain semantics and the business rules related to data quality.
• Data Product Owner: responsible for the data products that expose corporate data assets for user consumption. They ensure the implementation of quality controls on the data products they manage, providing users with visibility into the data quality metrics.
• Data Custodian: a member of the Data Product team or the Data Quality team, delegated by the Data Product Owner to perform operational monitoring of quality metrics on the exposed data.
• Data Quality Expert: person who works within Data Governance, specializing in defining policies, standards, and best practices to ensure effective management of data quality.
• Platform Engineer: person who works on the team developing shared services, including those supporting the implementation of the data quality framework, provided as part of the platform to support developers and users.
The main processes that contribute to measuring, monitoring, and reporting data quality status are:
• Governance rules definition process: it defines the protocols, technologies, standards, and common rules for implementing quality controls across the organization.
• Platform engineering process: it develops and deploys shared standard services (libraries, technological tools, and other software modules) to facilitate the implementation of controls, measurement, monitoring, and reporting of quality metrics.
• Knowledge modeling process: it defines stakeholder expectations and the business rules that serve as the reference for verifying data quality.
• Data product development process: it implements controls that allow measurement of quality metrics on the data assets exposed by the products.
• Issue management process: it analyzes and removes the root causes of data quality issues that have been identified.
The execution of these processes is carried out in alignment with policies and standards defined at the Data Governance level. This aspect is crucial in modular architectures, as it ensures the interoperability of data quality controls and a consistent interpretation of the monitored metrics.
Technologies
Implementing quality controls and monitoring metrics at scale requires the support of appropriate technological tools.
Data quality metrics are runtime metadata that are part of the broader set of observability signals (which also include application and infrastructure logs, runtime resource usage metrics, and traces of user requests). For this reason, it is advisable to adopt standard observability protocols and libraries to manage the generation and transmission of these signals.
The tools required to measure, monitor, and report data quality status in a distributed architecture are as follows:
Strategy & Operating Model
To maximize the effectiveness of outcomes, it is recommended to embed data quality implementation within the context of an overall Data & AI Strategy. This allows for a harmonious management of the entire portfolio of data and AI activities and for setting priorities aligned with the value the organization aims to deliver.
Furthermore, integrating data quality activities with the rest of the strategic portfolio enables leveraging ongoing work in other programs (for example, building a platform to improve the experience for users and developers) to also support data quality objectives.
In complex organizations, operating models are usually decentralized and federated. Consequently, different working groups are involved in subsets of the activities contributing to data quality implementation. It is therefore essential to maintain both operational and strategic coordination to facilitate the synergistic work of these groups in developing quality solutions that meet user expectations.
Given the complexity of implementing data quality, especially when it is necessary to remediate a large volume of existing data solutions, it is advisable to adopt an incremental and iterative approach, focusing on prioritizing the remediation of the most critical data quality issues on the assets that pose the highest risk for their intended use.