What is a data lake?

Raw data fidelity and long-term storage in the cloud.

Data lakes and data warehouses are both design patterns, but they are opposites. Data warehouses structure and package data for the sake of quality, consistency, reuse, and performance with high concurrency. Data lakes complement warehouses with a design pattern that focuses on original raw data fidelity and long-term storage at a low cost while providing a new form of analytical agility.

The Value in Data Lakes

Data lake solutions meet the need to economically harness and derive value from exploding data volumes. This “dark” data from new sources—web, mobile, connected devices—was often discarded in the past, but it contains valuable insight. Massive volumes, plus new forms of analytics, demand a new way to manage and derive value from data.

A data lake is a collection of long-term data containers that capture, refine, and explore any form of raw data at scale. It is enabled by low-cost technologies that multiple downstream facilities can draw upon, including data marts, data warehouses, and recommendation engines.

Insights from noncurated data
Prior to the big data trend, data integration normalized information in some sort of persistence – such as a database – and that created the value. This alone is no longer enough to manage all data in the enterprise, and attempting to structure it all undermines the value. That’s why dark data is rarely captured in a database, but data scientists often dig through dark data to find a few facts worth repeating.

New forms of analytics
The cloud era has given rise to new forms of analytics. Technologies like Apache Hadoop, Spark, and other innovations enable the parallelization of procedural programming languages, and this has enabled an entirely new breed of analytics. These new forms of analytics can be efficiently processed at scale, like graph, text, and machine learning algorithms that get an answer, then compare that answer to the next piece of data, and so on until a final output is reached.

Corporate memory retention
Archiving data that has not been used in a long time can save storage space in the data warehouse. Until the data lake design pattern came along, there was no other place to put colder data for occasional access except the high-performing data warehouse or the offline tape backup. With virtual query tools, users can easily access cold data in conjunction with the warm and hot data in the data warehouse through a single query.

New approach to data integration
The industry has come full circle on how to best squeeze data transformation costs. The data lake solutions offer greater scalability than traditional ETL (extract, transform, load) servers at a lower cost. Organizations employing best practices are rebalancing hundreds of data integration jobs across the data lake, data warehouse, and ETL servers, as each has its own capabilities and economics.

Common Pitfalls of Data Lakes

On the surface, they appear straightforward—offering a way to manage and exploit massive volumes of structured and unstructured data. But, they are not as simple as they seem, and failed data lake projects are not uncommon across many types of industries and organizations. Early projects faced challenges because best practices had yet to emerge. Now a lack of solid design is the primary reason they don’t deliver their full value.

Data silo and cluster proliferation
There is a notion that data lakes have a low barrier to entry and can be done makeshift in the cloud. This leads to redundant data and inconsistency with no two lakes reconciling, as well as synchronization problems.

Lack of end user adoption
Users have the perception—right or wrong—that it’s too complicated to get answers from data lakes because it requires premium coding skills, or they can’t find the needles they need within the data haystacks.

Limited commercial off-the-shelf tools
Many vendors claim to connect to Hadoop or cloud object stores, but the offerings lack deep integration and most of these products were built for data warehouses, not data lakes.

Conflicting objectives for data access
There is a balancing act between determining how strict security measures should be versus agile access. Plans and procedures need to be in place that align all stakeholders.

The data lake design pattern

The design pattern offers a set of workloads and expectations that guide a successful implementation. As technology and experience matured, an architecture and corresponding requirements evolved such that leading vendors have agreement and best practices for implementations. Technologies are critical, but the design pattern – which is independent of technology – is paramount. A data lake can be built on multiple technologies. While the Hadoop Distributed File System (HDFS) is what most people think of first, it is not required.

Teradata data lake solutions

Teradata Vantage, the platform for pervasive data intelligence, is designed to tap into the nuggets of information within customers’ data. The Teradata services team is well-versed in leveraging the many benefits of data lakes and related technologies such as Hadoop, Cassandra, and object stores like Amazon S3 and Azure Blob.

Cloud Analytics - AWS Amazon Web Services

Use AWS infrastructure with Teradata Vantage

Cloud Analytics - Microsoft Azure Microsoft Azure

Combine Azure resources with Teradata Vantage

Cloud Analytics - Google Cloud Google Cloud

Leverage Google Cloud with Teradata Vantage

Rise above needless bottlenecks and complexity, take analytics to the cloud