Definition

Amazon EMR (Elastic MapReduce)

TechTarget Contributor

By

TechTarget Contributor

What is Amazon EMR?

Amazon EMR (previously known as Amazon Elastic MapReduce) is an Amazon Web Services (AWS) tool for big data processing and analysis. Amazon markets EMR as an expandable, low-configuration service that provides an alternative to running on-premises cluster computing.

Amazon EMR is based on Apache Hadoop, a Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Using MapReduce, a core component of the Hadoop software framework, developers can write programs that process massive amounts of unstructured data across a distributed cluster of processors or standalone computers. It was developed by Google for indexing webpages and replaced its original indexing algorithms and heuristics in 2004.

Amazon EMR processes big data across a Hadoop cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3). The Elastic in EMR's name refers to its dynamic resizing ability, which enables administrators to increase or reduce resources, depending on their current needs.

Amazon EMR is used for data analysis in log analysis, web indexing, data warehousing, machine learning (ML), financial analysis, scientific simulation and bioinformatics. It also supports workloads based on Apache Spark, Apache Hive, Presto and Apache HBase -- the latter of which integrates with Hive and Pig, which are open source data warehouse tools for Hadoop. Hive uses queries and analyzes data, and Pig offers a high-level mechanism for programming MapReduce jobs to be executed in Hadoop.

Amazon EMR use cases

There are several ways enterprises can use Amazon EMR, including:

Machine learning. EMR's built-in ML tools use the Hadoop framework to create a variety of algorithms to support decision-making, including decision trees, random forests, support-vector machines and logistic regression.
Extract, transform and load. ETL is the process of moving data from one or more data stores to another. Data transformations -- such as sorting, aggregating and joining -- can be done using EMR.
Clickstream analysis. Clickstream data from Amazon S3 can be analyzed with Apache Spark and Apache Hive. Apache Spark is an open source data processing tool that can help make data easy to manage and analyze. Spark uses a framework that enables jobs to run across large clusters of computers and can process data in parallel. Apache Hive is a data warehouse infrastructure built on top of Hadoop that provides tools for working with data that Spark can analyze. Clickstream analysis can help organizations understand customer behaviors, find ways to improve a website layout, discover which keywords people are using in search engines and see which word combinations lead to sales.
Real-time streaming. Users can analyze events using streaming data sources in real time with Apache Spark Streaming and Apache Flink. This enables streaming data pipelines to be created on EMR.
Interactive analytics. EMR Notebooks are a managed service that provide a secure, scalable and reliable environment for data analytics. Using Jupyter Notebook -- an open source web application data scientists can use to create and share live code and equations -- data can be prepared and visualized to perform interactive analytics.
Genomics. Organizations can use EMR to process genomic data to make data processing and analysis scalable for industries including medicine and telecommunications.

Amazon EMR deployment options

As a cloud service, Amazon EMR can be deployed in a variety of settings, such as:

Amazon EMR on Amazon EC2. Amazon EMR can quickly process large amounts of data using Amazon EC2. Users can configure Amazon EMR to take advantage of On-Demand, Reserved and Spot Instances.
Amazon EMR on Amazon Elastic Kubernetes Service (EKS). The Amazon EMR console enables users to run Apache Spark applications with other applications on the same EKS cluster. Organizations can share compute and memory resources across all applications and use a Kubernetes tool to monitor and manage the infrastructure.
Amazon EMR on AWS Outposts. AWS Outposts enables organizations to run EMR in their own data centers. This makes it easier to set up, deploy, manage and scale EMR in on-premises environments.

Amazon EMR features

Amazon EMR's features are designed to make the following tasks easier and more convenient for administrators and developers:

EMR Studio. This integrated development environment helps developers write code and is designed to be an efficient, easy way to build and test applications. EMR Studio consists of a source-code editor, build-automation tools and a debugger.
Cost. The price of an Amazon 10-node EMR cluster is $0.15 per hour. Organizations pay only for the time their cluster runs. They can further control costs by setting up EMR clusters with Spot Instances, which enable users to bid on spare EC2 capacity and pay only for the resources used.
Elasticity. EMR separates compute and storage for individual scaling and to benefit from the tiered storage of Amazon S3. Instances can process data at any scale and are automatically provisioned, managed and monitored. With AWS Auto Scaling, users can increase or decrease the number of instances depending on use.
Reliability. Amazon EMR monitors clusters to ensure optimal resource use. It uses the Amazon CloudWatch service to collect and interpret metrics. Amazon EMR can monitor the health of a cluster, as well as its utilization and performance, and help to identify problematic nodes or jobs. It also offers a load balancer service, which helps direct traffic automatically to healthy nodes.
Security. Amazon EMR includes security features, such as automatically configuring EC2 firewalls to allow only necessary network traffic to the instances. Clusters are launched in an Amazon Virtual Private Cloud. Server-side encryption or client-side encryption can help manage keys. AWS Lake Formation or Apache Ranger modify data access controls for databases.
Flexibility. Amazon EMR enables users to customize clusters and install third-party software packages using scripts. Users can also reconfigure applications without relaunching the clusters.

Learn how to use AWS Outposts to bring Amazon cloud services to your data center.

This was last updated in August 2021

Continue Reading About Amazon EMR (Elastic MapReduce)

Use Amazon EMR with Apache Airflow to simplify processes

How to identify and avoid Amazon EMR problems

11 data science skills for machine learning and AI

8 data science projects to build your resume

How big data empowers organizations to work smarter, not harder

Dig Deeper on AWS infrastructure

App Architecture

Using bounded context for effective domain-driven design
Domain-driven design helps organizations develop software focused on key business needs. But to do so, architects need to ...
Object-oriented vs. functional programming explained
While plenty of developers entertain the idea of adopting a functional programming model, it's important to first know exactly ...
The 5 SOLID principles of object-oriented design explained
In this primer on SOLID, we'll examine the five principles this development ideology embodies, the practices they encourage and ...

Cloud Computing

Learn the basics of industry cloud platforms
Healthcare, finance and other specialized industries can ask a lot of cloud services. Find out more about industry cloud ...
Compare ESG tools from AWS, Azure and Google Cloud
ESG standards are beginning to influence how large enterprises procure and consume cloud services. Take a closer look at the ...
Thought leaders tips to obtain a secure cloud environment
Securing the cloud ecosystem is a multifaceted endeavor requiring both strategy and cooperation. Learn best practices and ...

Software Quality

New Google Gemini AI tie-ins dig into local codebases
Google ties in its latest Gemini AI model with company-specific data in a new code assistant and Vertex AI updates that also ...
JetBrains IDEs add local AI code completion
JetBrains beats GitHub Copilot to an increasingly popular trend that circumvents copyright and data privacy concerns with ...
How to apply impact mapping to software with examples
Impact mapping can reduce scope creep and enable flexibility in the SDLC by creating a shared understanding among all ...

Key takeaways from KubeCon 2024 in Paris
KubeCon + CloudNativeCon Europe 2024 offered attendees a look into the growing popularity of AI but also covered key areas such ...
KubeCon + CloudNativeCon 2024 news coverage
Trying to keep up with the latest news out of KubeCon + CloudNativeCon? Use this comprehensive guide to stay updated and informed...
Compare Nutanix AHV vs. VMware ESXi in the hypervisor battle
Though Nutanix AHV and VMware ESXi offer similar feature sets, admins' decisions will depend on several factors, such as ...

Close