“When Data Lakes Become Data Swamps”: Avoiding the Pitfalls of Data Regret by Implementing a Connected Data Ecosystem
Data regret refers to instances where data retrieval is no longer possible due to either accidental or intentional data destruction. Storage strategies such as data lakes or data lake houses are centralised repositories that provide a means of retaining data at scale.
However, data retention, or more precisely, excessive data retention, can come with unique challenges. Data lakes were largely invented to avoid data regret, but they have since fallen out of favour. Because these repositories contain such large volumes of data, retrieval can be laborious, and analysis complex.
Fausto Artico, Global R&D Tech Head and Director of Innovation and Data Science at GlaxoSmithKline, says, “when data lakes become data swamps, you know you have a problem.” Artico spoke at Oxford Global’s Pharma Data & Smartlabs UK: In-Person conference in 2022.
During a panel discussion called Strategies to Transform Healthcare Through a Connected Data Ecosystem, he discussed the need to move away from reliance upon data lakes to implement a connected data ecosystem.
Understanding the Quagmire of Excessive Data Retention
“Trying to ‘purify’ the unorganised and generally messy data held in a data swamp is a long and arduous process,” Artico explained. It often requires the expertise of data scientists and engineers to understand and obtain the information. The requirement for specialist professional handling, seriously limits the accessibility and practicality of the storage system.
- Data Management Strategies of Tomorrow: Bridging the Gap Between Retired Data Systems and Digital Innovation
- FAIR Data Management: What are the Best Approaches to Ensuring Information Findability?
- Digitisation and Beyond: Understanding the Lab of the Future
The pitfalls of data storage tie into the principles of FAIR data, meaning data that is findable, accessible, interoperable, and reusable. Moreover, depositing large amounts of information into data lakes can, in fact, become a bottleneck for achieving the data utility metric outlined by FAIR data. Although implemented as a strategy to avoid data regret, such measures can unintentionally impede efficient data obtainability and comprehension.
Implementing a Connected Data Ecosystem
So how is it possible to avoid the pitfalls of data regret strategies? The answer: data integration.
Data integration refers to collecting data from multiple sources across an organisation or business to provide an authoritative and complete data set for data analysis, bioinformatics, and other related processes.
Julie Pouget, who joined Artico on the panel, commented that “integration is key.” J. Pouget, R&D Global Data Governance Lead at Sanofi, explained how “fragmented or more generally speaking, just plain difficult to use data, requires harmonisation which can be achieved through introducing data lineage capabilities and key data quality indicators.”
Data lineage allows for the “understanding of where data comes from and indicates the transformations applied to the data in order to retrieve its history.” This is an especially helpful strategy for data harmonisation across a large organisation with multiple heterogeneous data sources. Data lineage capability facilitates enriched analysis and thorough investigation of source information.
Moreover, data lineage approaches offer a holistic view of the essential relationships and interactions that are either hidden or implicit in the data that needs to be uncovered. They also enable a more meaningful way to obtain and consume data for the end users.
Enhancing Data Security Within a Connected Ecosystem
As Artico pointed out, the question of how to build a connected data ecosystem for the future of healthcare “really comes down to the question of security.” Pouget agreed, saying that “making sure that the right people have the right access to the right data is what it is all about.” Being able to either tokenise or anonymise data is integral for meeting regulatory standards and accelerating integration and harmonisation across data sources.
Data lineage capability facilitates enriched analysis and thorough investigation of source information
From a data governance perspective, having data security procedures in place provides a degree of protection. “With security programs such as GxP validity software, you can essentially remove the human element from the loop and say to the regulators ‘look, there is no human error here,’” Artico explained.
GxP refers to ‘good [variable] practices’ set out by regulations and standards. Within the pharmaceutical industry, computerised GxP systems function and record a range of regulated processes and activities to enable a modern, legal, and secure data ecosystem. “To be able to accurately demonstrate to the regulators during inspection just how you have implemented a secured data ecosystem is going to be beyond helpful,” Artico continued.
Embracing AI in the Data Ecosystem...
Looking towards the future of the integrated data ecosystem, the use of Artificial Intelligence (AI) and Machine Learning (ML) will become an industry priority. This is because building a connected data ecosystem is a complex problem, requiring sophisticated process execution.
Humans can only consume and identify patterns in certain amounts of data. Still, ML can process data at scale and then generate insights that can be used to inform how humans interact with that data. “We need to look at how AI and ML with robotic process automation can simplify the job of accomplishing a connected data ecosystem,” Artico forecasts.
Learn more about Oxford Global's upcoming Pharma Automation & Robotics conference and visit our event website to download an agenda or register your interest.