Was this helpful?

(0) (0)

amazon-emr-vs-databricks

October 30th, 2020 3 min read

Databricks and Amazon EMR (Elastic MapReduce) are solutions for processing big data workloads. Both tend to be deployed at larger enterprises. Databricks handles data ingestion, data pipeline engineering, and ML/data science with its collaborative workbook for writing in R, Python, etc. Amazon EMR allows users rely on multiple open-source tools such as Apache SparkApache HiveHBase, or Presto, to integrate and process big data workloads more simply.

Features

Databricks and Amazon EMR boast distinct advantages for processing big data workloads. 

Amazon EMR/Elastic MapReduce is described as ideal when managing big data housed in multiple open-source tools such as Apache Hadoop or Spark. Users state that relative to other big data processing tools it is simple to use, and AWS pricing is very simple and appealing compared to competitors. It is secure, scalable, and highly available for a cloud service.

Databricks is praised for its core competencies; its data science notebook is better than alternatives (e.g. Jupyter Notebook) for enabling flexible and fast analysis on massive amounts of data while swapping between work in SQL, R, Scala, Python. Its open-source community documentation, available to all, is well regarded. And because the Databricks Community Edition is free and open-source, it is one of the relatively few options that presents a lower cost solution than Amazon EMR, though for the right users, and use cases.

Limitations

Users remark on similar limitations when considering Databricks and Amazon EMR for big data.

Amazon EMR is not a fast processor and shines primarily where users need a simplified framework for managing data from multiple tools. Also, particularly when compared to Databricks, the Amazon workbook and its machine learning capabilities are not as mature.

The licensed edition of Databricks is costly, as is its certification cost. Additionally, Databricks can be hard to use for non-technical users, who say its in-app help is unclear and hard to use. And a few say Databricks lacks good visualizations for displaying work.

Pricing

Databricks is available open-source and free via its community edition, or through its Enterprise Cloud editions, on Azure or AWS. Pricing can be complex.

Azure Databricks “Databricks Units” are priced on workload type (Data Engineering, Data Engineering Light, or Data Analytics) and service tier: Standard vs. Premium. Premium adds authentication, access features, and audit log. The Data Analytics workload is $.40 per DBU hour ($.55 premium tier) and includes data prep and data science notebook. The Data Engineering tier includes data pipeline and workload processing, for $.15 per DBU hour ($.30 Premium tier). Data Engineering Light is $.07 per DBU hour ($.22 Premium tier) and only allows users to run jobs.

Databricks AWS is also priced based on service tier (Standard, Premium, Enterprise) and workload type. Higher service tiers add Optimized Autoscaling, role-based access, federated IAM, HIPAA compliant storage, access lists for audit, and customer-managed keys. The Jobs Compute workload allows users to run data engineering pipelines and manage & clean data lakes (priced $.07, $.10, .$13 per service tier). The All-Purpose Compute service ($.40, $.55, $.65) is fully featured.

Amazon EMR is available from AWS, and is priced simply on a per-second rate for every second used with a one-minute minimum. Its hourly rate depends on instance type (e.g. standard, high CPU, high memory, high storage), with present price ranging from $0.011/hour to $0.27/hour. Amazon EMR is also available as an add-on service for Amazon EC2, and is available reserved, on-demand, or on lower-cost Spot Instances (i.e. AWS’s discounted service using EC2’s unused capacity). Pricing still falls within range of .011 to .27 per hour.

Was this helpful?

(0) (0)

TrustRadius Weekly