Definition

AWS Data Pipeline (Amazon Data Pipeline)

This definition is part of our Essential Guide: An insider's look at AWS re:Invent 2014
Contributor(s): David Carty

AWS Data Pipeline is an Amazon Web Services (AWS) tool that enables an IT professional to process and move data between compute and storage services on the AWS public cloud and on-premises resources. 

AWS Data Pipeline manages and streamlines data-driven workflows, which includes scheduling data movement and processing. The service is useful for customers who want to move data along a defined pipeline of sources, destinations and data-processing activities.

Using a Data Pipeline template, an IT pro can access information from a data source, process it and then automatically transfer results to another system or service. Access to the Data Pipeline is available through the AWS Management Console, the command-line interface or service APIs.

An activity is an action that AWS Data Pipeline performs, such as a SQL query or command-line script. A developer can associate an optional precondition to a data source or activity, which ensures that it meets specified conditions before running an activity. AWS Data Pipeline includes several standard activities and preconditions for services like Amazon DynamoDB and Amazon Simple Storage Service (S3).


How AWS Data Pipeline service helps
to better manage data-driven workloads,
together with examples of setting up and
provisioning a pipeline in the system.

A developer can manage resources or let AWS Data Pipeline manage them. AWS-Data-Pipeline-managed resource options include Amazon EC2 instances and Amazon Elastic MapReduce (EMR) clusters. The service provisions an instance type or EMR cluster, as needed, and terminates compute resources when the activity finishes.

Examples

data scientist would assign a job to AWS Data Pipeline so that it accesses log data from Amazon S3 every hour and then transfers that data to a relational database or a NoSQL database for future analysis. As another example, AWS Data Pipeline can transform data to an SQL format, make copies of distributed data, send data to Amazon Elastic MapReduce (Amazon EMR) applications, or process scripts to send data to Amazon S3, Amazon Relational Database Service or Amazon DynamoDB.

The AWS Data Pipeline service is suited to workflows already optimized for AWS, but it can also connect to on-premises data sources, as well as third-party data sources. Installing the Java-based Task Runner package on local servers will continuously poll AWS Data Pipeline to enable it to work with on-premises resources.

Pricing

AWS Data Pipeline charges vary according to the region in which customers use the service, whether they run on premises or in the cloud, and the number of preconditions and activities they use each month.

AWS provides a free tier of service for AWS Data Pipeline. New customers receive three free low-frequency preconditions and five free low-frequency activities each month for one year. These low-frequency activities and preconditions run no more than once a day.

This was last updated in May 2017

Continue Reading About AWS Data Pipeline (Amazon Data Pipeline)

Join the conversation

1 comment

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

What Amazon services do you include in your storage life cycle management plan?
Cancel

-ADS BY GOOGLE

File Extensions and File Formats

SearchCloudApplications

TheServerSide.com

SearchSoftwareQuality

SearchCloudComputing

Close