This content is part of the Essential Guide: AWS analytics tools help make sense of big data
Manage Learn to apply best practices and optimize your operations.

Manage cloud workflows with AWS Data Pipeline

The AWS Data Pipeline streamlines data between AWS data stores, enabling admins to define, schedule and debug workflows in the AWS cloud.

Amazon Web Services is synonymous with interoperability, at least among other AWS products. And its cloud services were designed to be used with one another in a workflow. An application running on Elastic Compute Cloud instances, for example, writes data to AWS S3, which is then copied to Elastic Map Reduce for analysis. And those results are then stored and analyzed through Amazon RedShift. Even though specific services and tasks vary, all cloud workflows have common requirements.

Workflows have a specific sequence of steps and dependencies between those steps; a report cannot be run until data is available, for example. Workflows typically run on a schedule and may need to process varying amounts of data that needs to scale up or down as needed. Workflows should be resilient to failure and allow for failed jobs to restart.

The AWS Data Pipeline meets these criteria. The service runs pipeline definitions that you define either in the AWS Management Console or through a command-line interface. The console provides a drag-and-drop interface similar to what you will find in many extract, transform and load tools (Figure 1).

Pipeline definitions are declarative specifications of a workflow. They include input and output specifications, tasks to perform, conditions to check before running a task, and scheduling information that explains when to run the pipeline definition. The AWS Data Pipeline service assigns tasks to Task Runners, which are responsible for reporting on task progress and reattempting failed tasks.

Pipeline inputs and outputs are specified as data nodes in a workflow. Currently, the Data Pipeline service supports S3, Redshift, MySQL and DynamoDB data nodes. The service includes several predefined activities, including copying data from one location to another, running an Elastic MapReduce cluster, running Hive queries or Pig scripts on an EMR cluster, executing a SQL query on a database, and running a Linux shell command. If the predefined activities do not meet your needs, you can always write a custom script.

AWS Data Pipeline UI
Figure 1. The AWS Data Pipeline interface lets users build workflows by dragging and dropping component icons and setting properties in forms.

Four system components of AWS Data Pipeline

The AWS Data Pipeline service uses four system-managed preconditions to help implement control flow logic into workflows. Three are designed to check for the existence of DynamoDB data, DynamoDB tables and S3 keys. The fourth checks to see if an S3 prefix is not empty. Users of the service can use their own script-based checks via the ShellCommandPrecondition.

The scheduling component of the Data Pipeline service is simple to configure. You specify whether a pipeline runs once or on a regular schedule. If it runs on a particular schedule, you also specify the interval between runs (e.g., every hour), as well as a start and end date and time. To run the pipeline indefinitely, use "never" as an end date and time.

You can configure pipeline components to issue notifications using Amazon Simple Notification Service (SNS) when component executions succeed, fail or are late.

Data Pipeline snafus

Pipelines, like any program, may not always work as expected. When you need to troubleshoot a pipeline, the first place to start is the Data Pipeline service management console. From there you can drill down into a list of pipelines, select an instance of a pipeline execution and see the results of each attempted execution.

It's important to understand the different statuses reported for pipeline components. Some -- CREATING, VALIDATING and RUNNING -- are fairly obvious. The WAITING_FOR_DEPENDENCIES status indicates that a precondition has not been met. TIMEOUT means the component exceeded a specified time threshold for executing the component. FAILED indicates a component did not work, while CASCADE_FAILED means a dependency of the component failed.

Log files are another important source of debugging information, so be sure to specify a log location when you create a pipeline.

Problems can occur when you haven't configured pipelines properly or assigned roles appropriately. Amazon's Data Pipeline service documentation offers tips on identifying and resolving common problems.

AWS places limitations on pipeline resources. A single AWS account is limited to 100 pipelines and 100 components per pipeline. The minimum scheduling interval is 15 minutes (min); the minimum time between attempts at retrying failed components is two min.

There are also limits on the number of API calls you can make. For example, you can use the API call "ActivatePipeline" once per second and the "ReportTaskProgess" call 10 times per second. AWS also allows for some degree of bursting. You can call "ActivatePipeline" five times in one second after a period of five seconds without calling that function.

About the author:
Dan Sullivan holds a master of science degree and is an author, systems architect and consultant with more than 20 years of IT experience with engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education, among others. Dan has written extensively about topics ranging from data warehousing, cloud computing and advanced analytics to security management, collaboration and text mining.

Dig Deeper on Amazon S3 (Simple Storage Service) and backup