Issue #10 - The Data Lifecycle
What are the steps you need to think about when taking data from source to consumption?
Read time: 10 minutes
Lifecycles are an important reality in ecosystems, and luckily in data, the data lifecycle is common terminology. But even I find it extremely confusing!
Why? Because every organisation approaches their data in slightly different ways, with slightly different terminology, leading to slightly different lifecycle definitions.
So as we traverse through the data ecosystem, define key terms and link them back to the business, let’s do the same for this VERY IMPORTANT concept, the data lifecycle.
Quick Data Lifecycle Definition
Ecosystems are full of lifecycles.
Some are short (bugs, flowers, worms) and some are longer (mammals, trees, fish). Each lifecycle marks the beginning to end of that living being, from its birth or conception to its evolutionary programming that dictates how it functions in life to how it dies or is consumed.
Data isn’t all that different.
The data lifecycle encapsulates the stages through which the data flows, from its initial creation/ acquisition through various phases of processing and utilisation, all the way to its eventual consumption, archiving and deletion.
By better understanding this lifecycle, data practitioners and leaders can improve their planning on what data to use, how to use it and when it doesn’t become useful any longer, both from a strategic and technical perspective.
Different Versions of the Data Lifecycle
While the data lifecycle is a well-known term, there are many different definitions of it. A common one, is the journey from generation & collection to storage & processing, through to management, analysis and interpretation before the data is finally removed or destroyed. This is the most simplistic view and there are quite a few versions of it.
Below I’ve included three fairly standard visuals of this from Scilife, Harvard Business School and TechTarget. I won’t go into detail about any of them here, however I will expand on some of these terms in my Ecosystem-view of the Data Lifecycle later in this article.
Another one of my favourite versions of the data lifecycle comes from Joe Reis & Matt Housley’s Fundamentals of Data Engineering. This version approaches it more from the data engineering perspective and focuses a lot on the ETL processes within what is usually termed the processing and storage stages.
What I love about Joe and Matt’s interpretation however is the undercurrents of the data lifecycle, and the requirement to think about security, architecture, ops, orchestration, etc. that might not be full stages, but are so impactful within data’s journey from generation to consumption.
During my research, I also found this simple and visually appealing view from Paul Singman. He breaks it down into four stages:
Data Ingestion — Bringing raw data into the data environment
Data Transformation — Logic applied to landed data to produce clean datasets
Testing & Deployment — Quality/validation tests applied during data publication
Monitoring & Debugging — Tracking data health and finding cause of errors
While I love the visual and some of the underlying elements, this perspective is more of an analytical data lifecycle, as all of the activities are about data travelling from source to consumption and some key elements are missing.
The Data Ecosystem Data Lifecycle
Unlike the examples above, this newsletter has a larger scope.
We aren’t talking about data in a vacuum!
Within the Data Ecosystem, the Data Lifecycle needs to encapsulate how data lives, provides value, is kept healthy, interacts with different teams, and considers new tools or methodology across the whole organisation.
This makes the lifecycle a lot more complex, but our goal here is to simplify that complexity while capturing the nuances that professionals experience in their daily tasks. Here, we capture it with six distinct categories and includes multiple data topics that need to be considered within each. I have visualised it in a graphic below:
The first stage is Data Generation. This stage is about bringing the data into the organisation’s data lifecycle, which includes sourcing, prioritising, and integrating the data.
Data Sourcing – Identifying and acquiring data from various internal and external sources. Most of this data will be pulled from technologies, applications, devices and other tools, but often this is manually collected as well from users. This step forms the foundation for all subsequent data activities.
Data Prioritisation – Determining the most valuable data based on predetermined use cases and business goals. This step is essential to focus resources on data that will drive the most impact and support strategic objectives, while not wasting time on irrelevant data/ information.
Data Integration – Bringing together prioritised data from multiple source systems. This might include designing and managing/ maintaining an integration layer to reduce complexity when sourcing data. Effective integration is vital for creating a unified view of data and improving access for end consumers.
The second stage is Data Storage & Processing. After integrating the data, this stage addresses how data is stored in your organisation according to the data models/ architecture and how it is processed, transformed and encrypted to get ready for teams to use it.
Data Storage – Getting your data into a secure storage area, usually done through cloud services. This type of storage might be accessible by users, or it might be a data lake only used by the data team. Either way, newly ingested data must be stored properly to ensure data is readily available and protected against loss or breaches.
Data Modelling – Within the database, the act of data modelling structures the ingested data to represent real-world business entities and relationships effectively. This enables better data organization and accessibility, improving the relevancy and usability of data before it is transformed.
Data Processing & Transformation – The classic step of cleaning, transforming, and preparing data for analysis. This is the necessary step to convert raw data into a usable format, often done by engineering through pipelines. Where and how this is done varies significantly based on the data philosophy, design and maturity of an organization.
Data Encryption & Compliance – Often done in parallel with processing (or ignored completely), this step is about protecting data through encryption and adhering to data privacy and security regulations. The importance of this step is only increasing with AI, regulation and data privacy concerns, ensuring the safeguarding of sensitive information through the data lifecycle.
The third stage is Data Management. Treat this stage as an underlying current that starts to become more prevalent as data is stored, served and analysed (hence why it is brought up now). Within this stage, we have to think about how the metadata, master data and data lineage are managed and used.
Metadata Management – This step is about managing the data that provides information about other data (like a dictionary and glossary) to ensure consistency and context when using data within the organisation. Proper metadata management provides clarity and improves data discoverability and usability.
Master Data Management – Setting the processes, policies and tools to create a single, trusted source of master data to ensure data quality. By focusing on ensuring consistency and accuracy across key business entities, this step is crucial to maintaining data integrity across the organisation’s data lifecycle.
Data Lineage & Observability – Monitoring data pipelines and infrastructure to ensure data health and performance, while tracking its flow and transformations from its original format to its final use. This step is becoming more popular given the amount of data that organisations are dealing with, and helps ensure transparency and traceability in data processes/ uses.
The fourth stage is Serving. Other definitions often skip this step or lump it into storage & processing. I think it is vital to call out separately as users now have more ways to access data than 5-10 years ago, and ensuring that access through data residency, well-defined contracts and solution-architected design is crucial to success.
Data Residency & Access – Ensuring data is stored in appropriate locations—like an Operational Data Store (ODS) or Analytical Data Store (ADS)—based on its use case and accessible to authorized users. This simplifies access and security processes, and clarifies why the data is being used rather than having everything stuck in a messy data swamp.
Data Contracts – A new tool that acts as an agreement between data producers & consumers. These define the expectations, responsibilities, and standards for data sharing between systems and help facilitate smooth data exchanges to maintain data quality consistency.
Data Design – Another more common term for this step is solution architecture, which is about preparing data for analytical platforms and consumption. This step differentiates between how and where the data will be used in analysis (e.g., BI, AI, ML, etc.), ensuring it is properly prepared.
The fifth stage is Analysis. This is where the rubber meets the road with data being used for analytics, AI/ ML, visualisation and actual business decision-making.
Analytics – Deriving insights from the data through various analytical methods (e.g., descriptive, diagnostic, predictive, prescriptive). This is the core step of the whole lifecycle for transforming data into actionable intelligence that gives your business a competitive edge.
AI Usage – With the growth of AI, it is worth calling out how data may need to travel separately to AI systems and machine learning models to feed automated decisioning tools. This is a separate step from Analytics because these AI applications (1) would likely require differently curated data, (2) aren’t usually linked to analytical use cases or teams and (3) might be fully engrained in business operations.
Visualisation – As a next step from analytics, visualisation presents data to facilitate understanding and insights, making complex data more accessible and comprehensible for stakeholders. This would be the front-end component of analytics, usually taking the form of dashboards, graphs, charts and applications.
Decisioning – Finally, the data insights are being used to make informed business decisions. Not really a tangible step in the data lifecycle, but an outcome or output allowing business users to guide strategies and actions and improve overall business performance.
The sixth and last stage is Reintroduction or Disposal. Often forgotten about or deprioritized, teams need to understand what to do with their data after it goes through the rest of the lifecycle, especially given time considerations.
Reverse ETL – Setting up pipelines to move processed data back into operational systems to inform day-to-day operations. This improves insights from operational data, as the data is pulled directly from analytical outputs deemed helpful by business users. This step requires strong oversight and references back to the processing, management and serving phases.
Disposal – If the data is old, expired or irrelevant, teams should securely delete data that is no longer needed. This is to ensure compliance with data retention policies and regulations. It also prevents data overload, reduces storage costs and mitigates security risks.
Archiving – If not reintroducing or disposing of the data, the other option is to archive it. This is where it is moved from primary storage to secondary, archival storage which is usually less expensive but still searchable if needed for future compliance reasons.
The core thing to understand from this adapted framework is that, in reality, the data lifecycle is not just the progression of data’s tangible qualities from start to finish, but it is a dynamic process of how data evolves in the complex world of business.
Without thinking about it in this way, the lifecycle remains too technically focused and theoretical, distorting the reality for practitioners on where and how data journeys through its lifecycle.
Next week, we will jump into a topic I am compelled to talk about—AI and the hype around it.
Subscribe and stay tuned for some candid material on what AI is in the Data Ecosystem, the danger of thinking about it in a vacuum, and what you need to do it properly!
Thanks for the read! Comment below and share the newsletter/ issue if you think it is relevant! Feel free to also follow me on LinkedIn (very active) or Medium (not so active). See you amazing folks next week!
I find it interesting that you write about data life cycle but what you describe is when the data enters an analytical environment. To truly understand data life cycle you need to start where the data is created. At the business process. Why is it created? For what operational/business purpose? What is it used for? When/why does it change and what process event triggers a change/delete of the data point etc. This is important also to understand when ”sourcing” the data to your analytical environment.