azure-hdinsight-vs-databricks
Databricks and Azure HDInsight are solutions for processing big data workloads and tend to be deployed at larger enterprises. Databricks handles data ingestion, data pipeline engineering, and ML/data science with its collaborative workbook for writing in R, Python, etc. HDInsight is a managed cloud service that allows users to run open-source frameworks like Apache Hadoop, Spark, and Kafka, easier.
Features
Databricks and HDInsight are generally well-liked solutions for big data processing, but standout reasons include the following:
Databricks is praised for its core competencies; its data science notebook is better than alternatives (e.g. Jupyter Notebook) for enabling flexible and fast analysis on massive amounts of data while swapping between work in SQL, R, Scala, Python. Its open-source community documentation, available to all, is well regarded.
HDInsight benefits from features of Azure: it is highly available with a satisfactory SLA, and the service itself is regarded as a cost-effective way for processing and retrieving data stored on Hadoop, or Azure Data Lakes.
Limitations
A few limitations exist that might cause one to look elsewhere for big data processing needs.
Azure HDInsight is cost-effective but some say its cost can balloon if it is used for long-term frequently queried data warehousing vs on-prem solutions that may be superior for this use case. Additionally, some users report glitches and performance issues when loading or processing very large volumes of data.
Databricks is costly, as is its certification cost. Additionally, Databricks can be hard to use for non-technical users, who say its in-app help is unclear and hard to use. And a few say Databricks lacks good visualizations for displaying work.
Pricing
Databricks is available open-source and free via its community edition, or through its Enterprise Cloud editions, on Azure or AWS. Pricing can be complex.
Azure Databricks “Databricks Units” are priced on workload type (Data Engineering, Data Engineering Light, or Data Analytics) and service tier: Standard vs. Premium. Premium adds authentication, access features, and audit log. The Data Analytics workload is $.40 per DBU hour ($.55 premium tier) and includes data prep and data science notebook. The Data Engineering tier includes data pipeline and workload processing, for $.15 per DBU hour ($.30 Premium tier). Data Engineering Light is $.07 per DBU hour ($.22 Premium tier) and only allows users to run jobs.
Databricks AWS is also priced based on service tier (Standard, Premium, Enterprise) and workload type. Higher service tiers add Optimized Autoscaling, role-based access, federated IAM, HIPAA compliant storage, access lists for audit, and customer-managed keys. The Jobs Compute workload allows users to run data engineering pipelines and manage & clean data lakes (priced $.07, $.10, .$13 per service tier). The All-Purpose Compute service ($.40, $.55, $.65) is fully featured.
Azure HDInsight Clusters are billed on a per-minute basis; clusters run a group of nodes that vary depending on the component. Processing Hadoop, Spark, Interactive Query, Kafka, Storm, and HBase does not incur a component cost (Kafka requires managed disks, however) while HDInsight Machine Learning Services incurs a cost of $0.016 per core-hour, and adding the Enterprise Security Package incurs a cost of $0.01 per core-hour. On Azure Virtual Machines general-purpose nodes for HDInsight costs from $0.06 per hour (1 CPU, 2GB RAM) to $0.631 per hour (8 CPU, 64 GB RAM). Memory-optimized nodes are available at an incrementally higher rate ($0.184/hour to $5.415/hour for an instance with 64 vCPUs and 432 GB RAM), as well as compute-optimized nodes ($0.295/hour to $1.179/hour for an instance with 16 CPUs and 32 GB RAM). Dev/test discounts are available for users on a Visual Studio subscription plan.
Was this helpful?
