databricks-vs-snowflake
Databricks and Snowflake are solutions for processing big data workloads and tend to be deployed at larger enterprises. Databricks handles data ingestion, data pipeline engineering, and ML/data science with its collaborative workbook for writing in R, Python, etc. Snowflake is a cloud-based SQL data warehouse that emphasizes analysis acceleration, data access, & BI collaboration through its Snowflake Data Exchange that allows data to flow across departments, and the Snowflake Data Marketplace data-as-a-service (DaaS) which provides data from third-parties or vendor partners, and allows users to monetize their own data.
Through a strategic partnership, a connection between the solutions enables one to use Databricks’ storage layer to ingest and prepare data for storage in Snowflake; and use a Databricks notebook to act on data stored in Snowflake (e.g. train an ML model). Documentation on how to implement the connection is published by Snowflake, Databricks, and Azure support.
Features
For entities processing large quantities of data, Snowflake and Databricks both provide unique advantages over competitors.
Users appreciate Snowflake for its power, and the ease-of-use of its SQL query engine, and the speed of its data warehouse when querying data. Snowflake is flexible, and sits on Azure, S3, or Google Cloud. It is also described as fast to set up, and operable with a low footprint.
Similarly, Databricks is praised for its core competencies; its data science notebook is better than alternatives (e.g. Jupyter Notebook) for enabling flexible and fast analysis on massive amounts of data while swapping between work in SQL, R, Scala, Python. Its open source community documentation, available to all, is well regarded.
Limitations
Databricks and Snowflake also have some key limitations that are important to consider.
Databricks is costly, as is its certification cost. Additionally, Databricks can be hard to use for non-technical users, who say its in-app help is unclear and hard to use. And a few say Databricks lacks good visualizations for displaying work.
Snowflake lacks a desktop tool, which opens a door to competitors with on-prem options ( Vertica, Teradata Vantage, SAP BW/4HANA). Snowflake’s performance is strictly cloud service provider dependent, which may present an issue. Some complain Snowflake’s UI is difficult to use, and that its table expression support and SQL editor lack expected features (e.g. debug, auto-fill).
Pricing
Snowflake’s pricing is based on storage and data loading usage, as well as service tier. The Standard tier starts at $2 per compute hour and is a complete SQL Data Warehouse with always-on encryption and 24-hour support. Snowflake Enterprise adds data masking, tokenization, search optimization for lookups, as well as extended “time travel” (i.e. historical data access) for $3 per compute hour. The Business Critical edition is $4 per compute hour and includes data compliance, Azure / AWS PrivateLink support, database failover protection, and continuity service.
Azure Databricks bills for virtual machines provisioned in clusters and Databricks Units, (DBUs, processing capability priced per second) based on VM instance. VMs are based on Azure’s rates. Databricks Units are priced on workload type (Data Engineering, Data Engineering Light, or Data Analytics) and service tier: Standard vs. Premium. The Premium tier adds authentication and access features, and audit log. The Data Analytics workload plan is $.40 per DBU hour ($.55 premium tier) and includes data preparation and the data science notebook. The Data Engineering tier includes only the data pipeline and workload processing; it’s available for $.15 per DBU hour ($.30 Premium tier). Data Engineering Light is available at $.07 per DBU hour ($.22 Premium tier) and only allows users to run jobs.
Databricks AWS price depends on service tier (Standard, Premium, Enterprise) and workload. Service Tier determines security and privacy; higher tiers add Optimized Autoscaling with Premium adding role-based access, federated IAM, etc., and Enterprise adds HIPAA compliant storage, access lists for audit, and customer-managed keys. The Jobs Compute workload service allows users to run data engineering pipelines and manage & clean data lakes (priced $.07, $.10, .$13 per service tier), and the All-Purpose Compute service ($.40, $.55, $.65) is fully featured.
Was this helpful?
