3 Requirements for an Enterprise Data Lake
There have been many articles written and talks given over the last several years on abandoning the Enterprise Data Warehouse (EDW) in favor of an Enterprise Data Lake with some passionately promoting the idea and others just as passionately denying that this is achievable. In this article, I would like to take a more pragmatic approach to the case and try and lay down a process that enterprises should consider for a data management architecture.
The focus is on data lakes for enterprises, referred to as Enterprise Data Lake to distinguish it from data lakes created by internet, ad-tech or other technology companies that have different types of data and access requirements.
The Enterprise Data Warehouse
The much reviled and beleaguered Data Warehouse has been the mainstay of enterprises for over 20 years supporting business reports, dashboards and allowing analysts to understand how the business is functioning. Data Warehouses when built right provide robust security, audit and governance which is critical – especially with the increasing cyber-hacks today.
Alas – many data warehouse projects are so complex, they are never finished! Further, the strict, hierarchical governance that many IT departments created around the warehouse caused lots of frustration as business analysts and researchers cannot explore the data freely.
The Hadoop Phenomenon
When Hadoop entered the mainstream, the big attraction for business analysts and data scientists was the ability to store and access data outside the restrictive bounds of IT! This raised the exciting possibility of finding new insights into business operations, optimizing spend and finding new revenue streams.
3 Requirements for the Enterprise Data Lake
James Dixon coined the term Data Lake in 2010 to mean data flowing from a single source with the data being stored in its natural state. We have come some ways from that definition and the most common definition of a Data Lake today is a data repository for many different types and sources of data, be they structured or unstructured, internal or external, to facilitate different ways of accessing and analyzing the data. The Data Lake is built on Hadoop with the data stored in HDFS across a cluster of systems.
The 3 requirements for the Enterprise Data Lake are:
- It must collect and store data from one or more sources in its original, raw form and optionally, its various processed forms.
- It must allow flexible access to the data from different applications; for example, structured access to tables and columns as well as unstructured access to files.
- Entity and transaction data must have strong governance defined to prevent the lake from becoming a swamp.
Enterprise Data Lake Architecture
The diagram below shows an Enterprise Data Lake that ingests data from many typical systems such as CRM, ERP and other transactional systems. In addition, it is fed unstructured data from web logs, social media, IoT devices, third-party sites (such as DMP, D&B) creating a data repository. This rich data eco-system can now support combining multiple sources of data for more accurate analytics and never-before possible insights into business operations.
With technologies such as BigForce SNAP, it is possible to run existing enterprise Business Intelligence (BI) tools as well as perform exploratory analysis with visualization tools such as Tableau.
Enterprise Data Lake Governance
More importantly, the Hadoop eco-system now supports data governance through technologies like Ranger, Knox and Sentry. In combination with Kerberos, and enterprise identity management systems such as Active Directory (AD) or other LDAP frameworks, it is possible to implement strong security and governance rules. See “Implementing Hadoop Security” for details.
The Modern Enterprise Data Architecture
But what if you already have an existing EDW with hundreds of applications, some of which use complex analytics functions? How best can you leverage the EDW while also moving to a modern data architecture that allows new data sources to be integrated and empower your data scientists to integrate, enrich and analyze lots of data without the restrictions of the EDW?
A happy compromise between the data lake and data warehouse does exist and data architects and businesses have realized that it IS possible to build on the strengths of each system.
In this architecture, the data lake serves as the repository for all raw data, ingested from all the relevant data sources of an organization. Optionally, the data lake can also store cleansed and integrated data which is then also fed into the data warehouse. This way, newer BI applications can be built directly on the enterprise data lake while existing applications can continue to run on the EDW.
Data Governance in the Enterprise Data Lake
Data Governance policies for enterprise data in the EDW should also apply to the same data within the Enterprise Data Lake in most cases. Otherwise, this may lead to security holes and data inconsistencies between the two systems. If careful consideration is not given to governance, the data lake will turn into a data swamp !
However, since the data lake consists of all the raw data from operational systems as well as new data sources, it is possible to now provide data scientists and other analysts access to these data sets for new exploratory analytics.
Architecting a modern data architecture requires a thorough understanding of the requirements, existing applications and future needs and goals of the enterprise. Especially important to consider are Master data and Metadata management, governance and security as well as the right technologies.
At Orzota, we have built data lakes for a variety of businesses and have a methodology in place to ensure success. Contact us for more information.