Issue #15 – The Data Quality Conundrum (Part 1 – Root Causes)
Explaining the root cause sources of data quality problems
Read time: 8 minutes
The number 1 most talked about issue in data is… (drumroll please)
…Poor data quality
But it’s not a straightforward problem. Instead, it is a problem that is consistently misconstrued by the data leadership and even practitioners brought in to fix it.
That is because data quality is an output, not an input.
It is a symptom of numerous underlying root cause issues within the data ecosystem. Tackling just ‘data quality’ will lead to potential short-term gain but continued long-term pain. Instead, it requires a holistic and systemic approach that addresses these root causes.
So let’s start to unpack this complex and convoluted issue.
What is Data Quality and what are its implications?
If you are reading this article, I’m sure you have some idea of what data quality is. At its core, data quality ensures that data represents the real-world construct it refers to and serves its intended purpose within an organisation (usually to do analytics or feed operational processes).
Data quality is a multifaceted concept that can be misconstrued in many ways. At its core, data quality is about ensuring that data is:
Accurate – The correctness of the data, ensuring it reflects the real-world scenario it is intended to model and represent. Within this, the data also needs to be precise—the measurements are close to each other with little random error
Consistent – Uniformity across data sources, ensuring that the data is the same between storage and usage and does not conflict within different datasets
Complete – All necessary data is present, with no missing elements that could impact analysis or decision-making
Timely – Data is up-to-date and available when needed, ensuring that decisions are based on the most current information
Unique – No duplicate records exist within the organisation, helping maintain the integrity and accuracy of the data (creating trust)
Validity – Ensuring the data remains accurate and consistent throughout its lifecycle and conforms to the agreed structure, data quality standards or list of values through validation rules or cleaning
Maintains Conformity - The data follows the standard data definitions like data type, size and format. This is to avoid different formats (dd/mm/yyy vs. mm/dd/yyyy) within similar categories of data
Has Integrity - Ensure data is recorded exactly as intended and relationships are maintained throughout its lifecycle and between systems. This relies on data validity and they work hand-in-hand
These eight elements set the foundation for what data quality is. Each of these measures plays a part in creating a source of information that business users can use and trust. To confirm the quality, there are different methods and approaches to measuring each of these components, which I will discuss in another article.
It is worth noting that most places outline six dimensions, but never agree on the same six to include. After posting this on LinkedIn and getting feedback, I decided eight was the right amount to be fully inclusive, even if there is some overlap.
Another benefit of defining data quality by these dimensions is the recognition business users have of what they mean, as mentioned to me by Piotr Czarnas. Even non-data people can understand what accurate or timely data means, helping translate terminology from a very technical field into a commonly understood language. The more data professionals can do this, the better!
In reality, most data does not adhere to these eight components. The implications are a loss of data trust from business users, worse decision-making, higher cost to complete data work/ analysis, inability to scale data tools/ solutions, and negative financial impact. On the last point, some popular figures about the impact of poor data quality include:
Gartner stated in 2021 that bad data quality costs organizations an average of $12.9 million per year
IBM conducted a study that showed data quality issues cost the US economy $3.1 Trillion in 2016, which has certainly gotten higher over the past decade
In 2017, Thomas Redman (writing for MIT Sloan Management Review) estimated that bad data might cost companies as much as 15% to 25% of their revenue
In 1999, the $327.6 million Mars Climate Orbiter went to Mars on the wrong trajectory because of poor data quality when there was a mismatch between NASA and Lockheed Martin over using metric and imperial units
Unfortunately, the hardest part about Data Quality is providing these figures and getting the investment to address these issues. Everybody theoretically knows that analytics, data science and AI is useless without strong data quality, but getting budget from AI-hungry non-data executives to do data quality is extremely hard without the right numbers and figures.
And this becomes even harder due to the nature of how data quality problems emerge—from multiple root sources.
Root Causes of Data Quality Problems
Solving for these massive data quality costs is not easy, which is why issues are still so pervasive within organisations. From a high-level perspective, it seems like you can put a few dollars and resources against data quality to fix the information and make it more accurate or complete. In reality, that is like putting a band-aid on a much bigger issue.
As mentioned before, data quality is a consequence of root causes that have to be properly addressed. To simplify the biggest of these root causes, I’ve outlined four categories below, each with underlying root cause components:
Business-Related Process Problems
Organizational Complexity – The multiple bureaucratic layers and siloed teams in large organisations often lead to fragmented data management practices, making it difficult to maintain consistent data quality. You can’t necessarily change this reality, but you can foresee it and learn how to handle it
Non-Standardised Key Performance Indicators (KPIs) – Not standardising KPIs across the organization causes inconsistent metrics and reporting with different departments even measuring the same KPI differently. This means data cannot be reliably compared or aggregated
Changing Business/Data Requirements – As business needs evolve, data requirements change. Without a flexible data management strategy, these changes can introduce inconsistencies and gaps in data, affecting its reliability and usability
Poor Data Entry Processes – Manual data entry will always be prone to human error, leading to inaccuracies. Without data entry standards and validation checks, these errors accumulate, degrading overall data quality
Management of Multiple Data Sources/ Systems
Lack of Data Management Approach – A disjointed approach across data teams and individuals means there is no consistency in how data is managed from ingestion to consumption, impacting the validity of data throughout the process
Different Tools/ Platforms Across Teams – Teams, functions and regions often use various tools and platforms for different data activities, leading to bespoke approaches by tool and inconsistent data formats, standards, and quality as an output
Ingestion from Multiple Systems – The number of data sources is continuing to increase in organisations, and data collected from various systems and platforms often lacks consistency. This creates quality issues when integrating into a single coherent dataset
No Data Observability Mechanisms for Quality Tracking – Without tools to monitor and track data quality, issues can go unnoticed and data quality improvement becomes reactive rather than proactive
No Basis for Data Quality Improvement
No Data Quality Standards – Everybody wants accurate, complete, timely data, but without standardised data quality benchmarks, it becomes impossible to properly measure and improve data quality
No Method to Fix Data Quality – Even if data quality issues are identified, most organizations lack systematic processes for resolving problems (other than manual updates), meaning they continue to be pervasive within data operations and output quality is compromised
Underinvestment in Data Governance
No Data Governance Team – I’ve seen multiple companies forgo a data governance team because their ‘tech is solid.’ However, a lack of a governance team often means data management practices and policies aren’t set or followed, leading to huge long-term issues when it comes to enforcing an approach to data management and a disconnect between the technical data side and the business data uses
Lack of Data Owners & Stewards – Data quality relies on somebody who understands the dataset (and its business implications) owning it. Without data owners and stewards, there aren’t any individuals responsible for maintaining data integrity, usability and validity
The problem with root causes is that they are hard to solve, and even if you tackle one, another one will come up to bite you. The poor or misaligned data quality outputs you see on a daily basis are likely due to one or more of these root causes (sometimes together, sometimes separately). Hence, a holistic approach to solving them is crucial to properly addressing data quality issues.
In next week’s article, I’m going to talk more about how you properly address data quality.
Before I end, I want to mention another perspective on business process root causes brought up in a LinkedIn conversation with Malcom Hawker. A lot of the business-related process problems are actually “a result of business functions being specialized…meaning the applications automating and supporting those functions will capture and manage data differently.” He described this as variable data quality rather than poor data quality, which needs to be taken into account when evaluating data across business functions, teams, and use cases. So before standardising everything to ensure data consistency, make sure any data quality fixes consider the business needs.
Thanks for the read! Comment below and share the newsletter/ issue if you think it is relevant! Feel free to also follow me on LinkedIn (very active) or Medium (not so active). See you amazing folks next week!