Issue #9 - Clarifying Data Terminology
Making data jargon a bit more simple, human, and relevant
Read time: 9 minutes (a longer, but important article)
Is it dah-ta or day-ta?
There is no right answer, just like there isn’t a right way to define what data is, as it means so many different things in today’s world.
So before we dive too far into navigating the Data Ecosystem, it is worth exploring and understanding the many things data can mean and what they really suggest.
Difficulties Defining Data
The word data implies a lot of things. To name a few, it can refer to:
Raw data gathered from source
Structured data input into a model
Outputs and results from that model
Curated evidence used to make decisions
Conceptual references to numbers or information
This variety of use leaves the terminology for data ambiguous. I mean what data professional hasn’t heard the question: “Can you just get me the data?”
This mentality is quite dangerous for organizations. A lack of defining what data represents and does in an organisation has downstream impacts on a lot of core topics and capabilities that businesses have huge challenges with. Some examples include:
Data Literacy – Poorly defined terminology makes a data literate workforce harder to achieve. And without data literacy, business users will not understand how data should be read, analysed, used and communicated, creating a gap in an organisation’s ability to drive value through data activities.
Data Collection – A lot of data quality issues occur before the data teams get involved, and a reason for this is poor business processes to collect, record, and store data. Without standard definitions, business users often input data incorrectly, leading to countless downstream impacts. Automation and new technologies help with this, but manual tasks will always exist and without the process or standardized definitions behind it, there will be cracks.
Data Classification & Management – Poor data quality is compounded by an inability to classify and manage data. Technologies exist to do these things (e.g., MDM software, data catalogs, etc.), but a lack of understanding of what the data is for, or definitions underpinning those technologies, rendering them useless in the grand scheme of things. This challenge dovetails nicely with data collection, as both often go hand in hand.
Data Culture – Culture is built on shared values and perspectives. When teams are speaking different languages about the data, there can be no common foundation to establish a data culture.
Leadership Perception – Executives are usually not the most data literate, so any confusion around terms or definitions that reaches the C-suite has disastrous impacts. Programs will get defunded because executives think you are referring to data they can readily use, but instead they get raw data or complex outputs that they don’t understand. Leadership needs to know what data is being used for, the value it is getting, and how it is defined.
Categories of Data
How do we solve for this confusion?
The first step is to actually define what data means in the organisation.
But, to be clear, this is not explaining the purpose of different datasets or defining variables and KPIs.
This is about breaking down the overwhelming term (DATA) into component parts and actually understanding what it can mean to different audiences and in different settings.
The outcome of doing this is about creating consistency and a foundational understanding of how data can be used to describe and reference different things. Essentially this is an English lesson about making the word data less buzzword and more concrete.
The first term we will define is Data Concepts. In my view, Data Concepts refer to ideas of what data can refer to at a high-level or in strategic conversations. An example of this is Big vs. Small Data. These terms allude to the scale of the data collected and how it is analysed or used. Data Concepts can also extend to high-level categories for how the data is being processed and used (e.g., operational or analytical data). Understanding what these concepts mean within the context of your organisation is important because they make their way into a lot of strategy documents, executive conversations, and presentations.
The second term is Data Domains. Here we are referring to the organisational source of the data. Examples of this are marketing, financial, supply chain, or product data, each of which has a distinct domain where the data originates and where it is primarily used. Defining the data domains is an incredibly useful task because it also establishes a common language between data and business colleagues. Tools like Data Catalogs can supplement this by providing a dictionary, automating the process and allowing for further definitions to be added.
The third term is Metadata. The idea of metadata is becoming more powerful and impactful as the amount of data in organisations increases. It refers to the data that provides information about other data, kind of like a dictionary and glossary. Examples of this include descriptive metadata, structural metadata, administrative metadata, or reference metadata, each of which serves a specific purpose in describing the attributes, format, and context of data assets. The importance of metadata (and of not confusing it with lower level data) is that it plays a crucial role in data management by enhancing discoverability and accessibility, allowing for better tracking, integration, and analysis of data across various systems.
The fourth term is Sources of Data. There are two levels to this. The first is whether the data is real or synthetic. While most data today is real, a lack of useable data and the rise of AI has led to a lot more synthetic data being created. Noting this distinction will be crucial in the future as it will have implications on quality and trust in the data. The second level is the actual tangible source of the data. Here we refer to how the data is generated and captured. Examples of this might include transactional POS data (e.g., sales records, banking transactions), IoT sensor data from devices or wearables (e.g., temperature, motion sensors), or log data taken by systems or software (e.g., web server logs, status reports). This data feeds into your platform and informs analytical outputs, and these need to be defined to trust those insights.
The fifth term is Data Formats. After digging into the source data, you will find it is organized in different ways. What we are alluding to here is whether the data is Structured, Unstructured, or Semi-Structured. Structured Data is highly organized, stored in predefined models such as relational databases (e.g., SQL databases, CRM data), which allows for efficient processing and querying. Unstructured Data makes up most of the data in the world. It cannot be contained in a row-column database or data model and typically consists of undefined formats like text files, multimedia content, and social media posts. Unstructured data requires sophisticated tools like NLP to extract actionable insights. Finally, Semi-Structured Data straddles the line between the two, featuring organisational markers like file semantic tags, headers, or metadata, but with the underlying data completely unstructured. An example might be an email message with unstructured content, but some structured data to govern the file type. Understanding these three data formats is crucial for effective data management and data solution decisioning.
The sixth term is Data Processing Stages. In addition to the format of the data, you also need to know its level of processing. The data processing terminology, therefore, describes data as it evolves from its initial capture to its final form. The process begins with Raw Data, which is unrefined and directly collected from various data sources (the quality of this data will depend on the source). The next stage is Sourced Data, which involves aggregating and organizing raw data from multiple origins and preparing it for further processing. The Cleansed Data stage involves scrubbing the data to remove errors, duplicates, and inconsistencies to create a consistent format and ensure the data's accuracy and reliability. The Conformed Data stage then standardizes and aligns this cleaned data across systems, harmonising it to master data against data model standards. This is where the ‘Single Version of the Truth’ is created. Business rules and requirements are then applied to create Derived Data, transforming the conformed data to generate new values or aggregating existing measures to suit specific analytical objectives. The final stage is Curated Data, where data is fully prepared, enriched with context, and stored in a manner that facilitates easy access and analysis, making it ready to drive strategic decisions and insights. We will build more on these stages next week in a further article about the data lifecycle, as each of these ‘types’ of data demonstrates a key point in that lifecycle.
The seventh and final term is Data Types. This last area of data is getting into the weeds, referring to the various categories of data that can be used and manipulated in data analysis. These types are fundamental to understanding how data can be stored, processed, and interpreted. Some examples include[DA5] numeric (integers and floating-point numbers, essential for calculations and statistical analysis), textual (strings or characters, encompassing textual content that may require parsing or segmentation for analysis), date/ time (crucial for time-series analysis, trend detection, and chronological sorting), and Boolean (representing the binary values of true and false used in conditional operations and logic-based programming). Each data type has specific attributes and operations that are optimally supported by various programming languages and database systems. Understanding the difference between them is essential for detailed data activities like data management, designing data models, implementing databases, and performing data analytics.
Okay, that was a lot. Breaking down what data actually means is a thankless exercise (nobody will give you praise) but it can add context you need to better navigate the data ecosystem.
Each of these categories mean different things and will have different implications on your work, so figure out how and why to use each ‘type of data term’.
Next week, we will dive into the data lifecycle process at a high-level (see image below for a taster), where this knowledge about data types will help you that much more! See you then and leave any comments/ feedback you might have!
Thanks for the read! Comment below and share the newsletter/ issue if you think it is relevant! Feel free to also follow me on LinkedIn (very active) or Medium (not so active). See you amazing folks next week!