Q
Get started Bring yourself up to speed with our introductory content.

AWS Glue simplifies messy data transfers

We need to perform extract, transform and load processes on data, but the work is labor-intensive and error-prone. Is there an AWS tool to simplify this task?

Developers only sparingly use ETL tools because they involve error-prone manual coding. Dev teams must discover...

the data, convert it to the desired format, map it on the cluster, schedule jobs and then test them. But a new AWS tool solves this problem and simplifies the once-manual process.

AWS Glue is a managed extract, transform, load (ETL) service that moves data among various data stores. The service generates ETL jobs on data and handles potential errors; it creates Python code to move data from source to destination. AWS Glue removes potential issues with hand-coding ETL tasks, as subsequent changes to data format, volume and target schemas require frequent manual revisions to code. The AWS Glue service has three components:

  • Data cataloging: Generates a Hive meta store that contains information regarding data types and partition formats for tables.
  • Job authoring: Enables AWS Glue to generate code to move data from source to destination; developers can share code to Git for version control.
  • Job execution: Completes the task; developers don't need to deploy, configure or provision servers for AWS Glue. Jobs automatically run in a Spark environment.

The ETL scripts from Glue can handle both semi-structured and structured data. If AWS Glue encounters bad data, it places error rows in separate Simple Storage Service (S3) buckets instead of allowing the job to crash. If a crash occurs, the job continues from the point where it stopped; therefore, no data duplicates.

Developers can use these job-triggering techniques:

  • Schedule-based, which is time-dependent and triggers at a particular time.
  • Event-based, which waits for a signal from another job.
  • External sources, which runs code in response to external triggers. For example, developers can code job triggers from AWS Lambda.

AWS Glue is fault-tolerant, meaning it retrieves any failed jobs. Developers can also debug logs to assess errors. The ETL service also integrates with other AWS tools and services, such as Amazon S3, Amazon Relational Database Service and Amazon Redshift.

Next Steps

Admins have their say on data management in AWS

Hadoop, Spark provide ETL offerings for AWS

Data replication and ETL processes vary based on enterprise requirements

This was last published in May 2017

Conference Coverage

Your guide to AWS re:Invent 2017 news and analysis

Have a question for an expert?

Please add a title for your question

Get answers from a TechTarget expert on whatever's puzzling you.

You will be able to add details on the next page.

Join the conversation

2 comments

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

What are the biggest challenges of moving data among AWS stores?
Cancel
How can we integrate AWS Glue with traditional legacy data-sources (e.g. on-premise databases)?
Cancel

-ADS BY GOOGLE

SearchCloudApplications

TheServerSide.com

SearchSoftwareQuality

SearchCloudComputing

Close