ra2 studio - Fotolia

Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

What you need to know about Cloudera vs. AWS for big data

Enterprises in need of a big data platform must run some analytics of their own to choose a vendor. AWS' integration between services can't be beat, but is Cloudera a better choice?

When it merged with fellow big data management vendor Hortonworks in January 2019, Cloudera Inc. gained a better chance to compete with cloud providers' Hadoop offerings -- setting up an AWS faceoff.

The upcoming Cloudera Data Platform (CDP) will be an open source, cloud-hosted big data offering meant to challenge Amazon Elastic MapReduce (EMR) -- AWS' Hadoop service -- and other cloud-oriented big data analytics applications also built on Hadoop. CDP does not have a release date yet.

Cloudera also partnered with IBM in June 2019 to collaborate on big data and AI offerings and resell each other's services: Cloudera Enterprise Data Hub and DataFlow as well as IBM Watson Studio and Big SQL.

Let's take a look at what this Cloudera and IBM partnership might mean for users with big data workloads on the cloud and how CDP changes the contest of Cloudera vs. Amazon EMR.

What IBM brings to Cloudera

The Cloudera and IBM partnership is first a reaffirmation of the Hortonworks-IBM partnership prior to the Cloudera merger, said Dave Mariani, founder and chief strategy officer at data warehouse virtualization provider AtScale.

Before they merged, Cloudera and Hortonworks focused on the Hadoop file system and tools for large data lakes. With these capabilities, enterprises could save all their data in one place and repurpose it for various analytics and AI purposes. In practice, though, enterprises have struggled with Hadoop performance problems, and as a result, many enterprises have turned to cloud providers to outfit their data management fabric.

Post-merger, Cloudera's partnership with IBM could help enterprise customers address Hadoop performance problems through IBM's extensive service and support organization and partnerships. In contrast, AWS provides a comprehensive set of tools for automating many aspects of big data deployments and is an attractive choice for companies with AWS development and deployment skills.

Cloudera vs. Amazon EMR

The Cloudera and IBM partnership and CDP offering should be most attractive to companies entering the early stages of a big data analytics strategy that have data and applications spread across on-premises and cloud environments. It is not likely to draw companies with a substantial AWS presence and skill set.

In partnering with IBM, Cloudera has tied itself to IBM's hybrid and multi-cloud agenda. Therefore, Cloudera and IBM should be the best fit for enterprises with a hybrid cloud data strategy, Mariani said. IBM asserts that a hybrid or multi-cloud approach is more realistic than locking in to one provider, he said.

IBM's approach to supporting modern app development is to use Kubernetes and containers so that workloads can run anywhere: on premises, private cloud or public cloud. AWS, on the other hand, wants all workloads to run only on its cloud.

While multi-cloud may be a viable approach, Mariani does not expect many enterprises to go that route soon. The cloud users he speaks with are all-in on their chosen public cloud vendor and contract a secondary vendor for backup only. The main benefit customers see in AWS and other vendors isn't easy access to servers, but the tightly integrated services and tools that take enterprise IT out of the systems integration business, he said.

For example, Amazon EMR uses S3 and integrates with its data catalog AWS Glue and with its database Redshift. AWS' strengths come from API integrations, availability and scale in terms of geographic regions and interoperability across its range of services. These native tie-ins put third-party technologies such as Cloudera at a disadvantage to EMR, especially if data platform buyers are trained and certified on AWS operations and management.


Cloudera wins vs. AWS, though, when organizations seek high-end service, support, implementation, security and compliance for the data platform, said Marty Puranik, president and CEO of Atlantic.net, a hosting provider.

Cloudera Data Platform will have security, governance and metadata baked into the exchange fabric between data sources and analytics workloads when it launches. Cloudera has created the Shared Data Experience connection fabric, called SDX, that manages and automates these processes. To build security into Amazon EMR, developers must set up the encryption between their apps.

One valuable capability on the AWS side vs. Cloudera is that it supports Jupyter-based EMR notebooks that easily work across AWS products such S3, DynamoDB and Redshift. CDP often involves more work connecting Jupyter-based notebooks to these services. Jupyter notebooks are useful for data visualization, cleaning, modeling and other tasks. The sharable documents can contain live code, equations, visualizations and narrative text.

Implementation and cost

The ultimate costs of Cloudera vs. AWS for big data management come down to implementation, compliance, security and performance. AWS caters to enterprises with in-house expertise and cloud centers for excellence, whereas Cloudera and IBM offer more guidance through professional services

"AWS will have a lower sticker price, but could end up being much more if you don't know what you're doing," Puranik said.

For example, developers can incur significant egress charges if they send out more data from the cloud than required for a particular workload. Another big problem could come from a misconfiguration issue, such as leaving an S3 bucket open, as was the case in the 2019 Capital One breach

If users aren't all-in on one particular public cloud yet, or unsure of what they need, they should look at the Cloudera and IBM first, even if the upfront cost is higher than Amazon EMR. To truly understand if one or the other data platform is a fit, use trial workloads.

"Start with smaller projects, if possible, and see which one fits your organization best," Puranik said.

Dig Deeper on Amazon database and analytics strategy

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

What will decide your Cloudera vs. AWS big data management decision?
There are so many mistakes in this article, I actually don’t know where to start..: - Why are you putting Cloudera‘s technology so close to IBM? There are actually no ties at all and IBM is just another reseller. - Clients are using another cloud provider for back up only? How should that work? If you’re all into AWS, none of your services run on another cloud while Cloudera‘s services do. - Tight Integration of AWS services? They are not even natively integrated into S3, as long as services like Redshift are creating their own copies of data stored in S3. Glue is useful, but you have to do most integration work by yourself. - IBM providing services to overcome performance problems? Not sure how you got the impression of IBM providing deep Hadoop knowledge. I would rather mention other SIs here.. - AWS support for Jupyter notebooks is not a big deal. Cloudera‘s Data Science Workbench supports Jupyter as well as other DS tools as external editors ... on all platforms. - Cloudera and Horton focused on HDFS pre-merger? Not true. Both companies supported S3, ADLS and GCP already natively before the merger with CDH and HDP.
CapitalOne did not leave an S3 bucket unsecured.