Amazon Web Services' Elastic MapReduce is a Hadoop implementation that allows you to run large pre-processing jobs, such as format transformation and data aggregation. And while a number of programming language choices are available to code these jobs, time-strapped developers want a programming framework that minimizes coding overhead. Mrjob, Dumbo and PyDoop are three Python-based Elastic MapReduce frameworks that fit the bill.
Why won't popular languages like Java or Apache Pig work? Amazon Elastic MapReduce (EMR) jobs are often written in Java, but even simple programs can require many more lines of code than a comparable script written in Python. Pig is a high-level data processing language designed for loading and transforming data, but it's not a general-purpose language.
Developers who prefer to work with languages that are higher level than Java but still need more than Pig's data management features should look to Python. And three Python-based EMR frameworks are available: mrjob, Dumbo or PyDoop.
Going open source with mrjob
Mrjob is an open-source package for running jobs in Amazon EMR or locally on your machine. Elastic MapReduce jobs are defined in a single Python class that contains methods for mappers, reducers and combiners. Much of the lower-level details of Hadoop are hidden behind mrjob abstractions, which can be beneficial. The simplified model allows developers to focus on the logic of the map-and-reduce functionality. This, however, means you're limited to a subset of the Hadoop API. Dumbo or PyDoop may be a better option if you need more access to the Hadoop API.
A key advantage of using mrjob is that it does not require a Hadoop installation. Developers can write, test and debug Elastic MapReduce programs on a single device using Python, mrjob and other Python dependencies. Once the program is ready, you can port it to EMR where the same code can run on a Hadoop cluster without any changes. Yelp, the social networking site that hosts 57 million reviews with more than 130 million visitors per month, developed and still uses mrjob, so it can meet the needs of many Hadoop users.
Job processing with Dumbo
Dumbo is another Python framework that supports EMR. Similar to mrjob, you write mapper and reducer classes to implement Elastic MapReduce jobs. In addition to the basic functions found in mrjob, Dumbo provides more job-processing options. It has a jobs class that allows developers to define multiple sets of map-and-reduce operations run by a single command. This is useful when multiple passes are made over a data set.
Dumbo also supports text and sequence file formats as well as custom formats using a user-specified Java class. On the downside, documentation for Dumbo is sparse, especially compared to mrjob's documentation.
Dumbo follows the MapReduce paradigm, so coding core components in this framework is similar to how it's done in mrjob and PyDoop. It also allows you to implement partitioners, which are similar to reducers except they run locally. They can reduce the volume of data transferred between map and reduce operations.
With Dumbo, developers can also control Hadoop parameters from the command line when launching jobs. Hadoop uses plain text by default, but you can process other formats by specifying a custom RecordReader class. This class should include initialization, next, close and getProgress methods.
It also supports the Hadoop file system API, which connects to an HDFS installation and read and write files. In addition, the API retrieves metadata on files, directories and the file system. In cases where you need low-level access to the file system, the Dumbo API can help since it has the same set of functions as the HDFS API.
Research package access with PyDoop
Python developers who need access to third-party libraries and packages might want to consider PyDoop. The CRS4 interdisciplinary research center developed this framework and also maintains it. One advantage of this is that you can count on access to popular Python research packages, such as SciPy.
Despite the benefits of mrjob, Dumbo and PyDoop frameworks, they will add execution overhead, so your jobs will likely run longer than if they were written in Java or via Hadoop streaming. If keeping EMR costs low are a key consideration, compare MapReduce jobs written in Python streaming with other frameworks to get a sense of the additional time required to run the jobs.
Python frameworks for Hadoop are useful when you're writing several EMR jobs. These three frameworks work on Elastic MapReduce and can help avoid unnecessary and tedious Java coding. When you need more access to Hadoop internals, consider Dumbo or PyDoop.
About the author:
Dan Sullivan holds a Master of Science degree and is an author, systems architect and consultant with more than 20 years of IT experience. He has had engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education. Dan has written extensively about topics that range from data warehousing, cloud computing and advanced analytics to security management, collaboration and text mining.