noun_Email_707352 noun_917542_cc Map point Play Untitled Retweet Group 3 Fill 1

Data lakes and warehouses part 2: Databricks and Snowflake

Time to move data analytics to cloud. We compare Databricks and Snowflake to assess the differences between data lake based and data warehouse based solutions.

Timo Aho / September 07, 2021

In this post we go through the distinction between data warehouse based and data lake based cloud big data solutions. We do this by comparing two popular technologies available in multiple cloud environments: Databricks and Snowflake.

To get the background into the topic, please read my previous blog post about the data lake and data warehouse paradigms. Other posts in the data lakes and warehouses series are: 

Part 1: Intro to paradigms

Part 2: Databricks and Snowflake

Part 3: Azure Synapse point of view

Part 4: Challenges

Part 5: Hands on solutions

Part 6: Microsoft Fabric as a data lakehouse technology

 

Databricks and Snowflake

As we learnt in the previous post, data analytics platform can be divided into multiple stages. Above, we can see a picture giving a general understanding of roles for Snowflake and Databricks in the pipelines. Here we can categorize the tools to either processing (green) or storage (blue). Databricks is a processing tool and Snowflake covers both processing and storage. Delta lake, on the other hand, is a storage solution related to Databricks. We will cover it later.

According to the definitions given in the previous article, we can roughly say that Databricks is a data lake based tool and Snowflake is a data warehouse based tool. Let us now dig a bit deeper into these tools.

Databricks is a data lake tool with data warehouse features­

Databricks is an Apache Spark based processing tool that provides programming environment with highly and automatically scalable computing capacity. Apache Spark is the de facto standard programming framework for coding based big data processing.

Databricks billing is essentially usage based. You pay for the used computational resources and nothing else. In principle, Databricks is particularly suitable for processing data in the early stages of a pipeline, especially between bronze and silver layers. It can also be used for preparing gold layer data but is not at its best in providing data for, say, reporting tools.

Recently, Databricks has significantly extended its capabilities to the direction of a traditional data warehouse. Databricks provides a ready-made SQL query interface and a lightweight visualization layer. In addition, Databricks offers a database type table structure. The database type functionality is specifically developed with Delta file format.

Delta file format is an approach for taking database strengths into the data lake world. The format provides, among others, a data schema versioning and database type ACID transactions. In accordance with the data lake paradigm, the file format itself is open and free to be exploited by anyone.

Based on Delta format and Databricks tool, the company is trying to spread a notion of a novel “Data Lakehouse” paradigm for a data lake and data warehouse hybrid approach.

Snowflake is a scalable data warehouse drawing from the data lake paradigm

Snowflake is a scalable data warehouse solution developed specifically for cloud environments. Snowflake stores data in a cloud storage in a proprietary file format. The data is therefore only available through Snowflake, according to the data warehouse paradigm. In addition to computational resources, you also pay for the data storage in the Snowflake file format. However, you also have the typical data warehouse features like granular permission management available.

Snowflake disrupted the data warehouse market a few years ago by offering highly distributed and scalable computation capacity. This is done by completely separating storage and processing layers in the data warehouse architecture. Traditionally, this has been a major obstacle for data warehouse solutions in the big data world. This is one of the ways Snowflake is expanding its solution in the direction of the data lake paradigm. Nowadays it offers, among others, efficient tools for real-time data ingestion.

It is probably not an overstatement to say that the success of Snowflake caused a crisis in Amazon Redshift and Azure Data Warehouse development. Scalability of the latter two data warehouse solutions was significantly more restricted: If you wanted to avoid high expenses, you needed to choose between small storage capacity or slow processing. Very often, a suitable combination was difficult to find. Thus, you usually paid a significant amount of money for reserve resources you did not actually use. Nevertheless, both the products have taken steps towards solving this issue.

Conclusions: Databricks and Snowflake

In this post we discussed two very popular multi-cloud data analytics products: Databricks and Snowflake. We specifically studied them from the viewpoint of their background paradigms as discussed in the previous blog post.

We noted that Snowflake has a basis in data warehouse world while Databricks is more data lake oriented. However, both have extended their reach beyond typical limits of their paradigms.

Both tools can definitely be used alone to fulfill the needs of a data analytics platform. Databricks can serve data directly from a storage or export data into data marts. There is no need for a separate data warehouse. On the other hand, data can be ingested directly to Snowflake for processing, modeling, and offering. In my experience, pure Snowflake solutions are more common, perhaps because Databricks has not been around for so long.

However, as brought up in the previous post, it might be a good idea to use both of the products in a single platform. The breakdown of this kind of solution is depicted in the picture with Databricks reading and processing raw data and Snowflake taking care of the publishing end of a pipeline. It is also important to note that Databricks and Snowflake are doing collaboration for better integration between the products.

All in all, future seems even brighter for hybrid solutions.

Livecast on-demand: Time to harness the benefits of a modern cloud data platform – how Telia succeeded with Snowflake

Watch our livecast to hear about Telia’s transformation journey and what role the Snowflake data platform has played in their successful data development.

Watch recording here

Data Insiders – Stay in the know

Data changes the world – does your company take full advantage of its benefits? Join Data Insiders, the #1 Nordic data community, a powerful network of top professionals and visionaries of data-driven business.

Data Insiders addresses the trends and phenomena around this hot topic in an understandable and interesting way. Together we share knowledge, offer collegial support and reveal the truth behind hype and buzzwords. We seek the answer to one particular question: how can data help us all to do better business?

Join Data Insiders today and stay at the forefront of the data revolution with access to quality podcasts, peer events and insights.

Timo Aho
Cloud Data Expert, Tietoevry Create

Timo is a cloud data expert (PhD) with over a decade of experience in modern data solutions. He enjoys trying out new technologies and is particularly interested in technologies of storing, organizing and querying data efficiently in cloud environments. He has worked in data roles both as a consultant and in-house.

Share on Facebook Tweet Share on LinkedIn