BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Developers only sparingly use ETL tools because they involve error-prone manual coding. Dev teams must discover the data, convert it to the desired format, map it on the cluster, schedule jobs and then test them. But a new AWS tool solves this problem and simplifies the once-manual process.
AWS Glue is a managed extract, transform, load (ETL) service that moves data among various data stores. The service generates ETL jobs on data and handles potential errors; it creates Python code to move data from source to destination. AWS Glue removes potential issues with hand-coding ETL tasks, as subsequent changes to data format, volume and target schemas require frequent manual revisions to code. The AWS Glue service has three components:
- Data cataloging: Generates a Hive meta store that contains information regarding data types and partition formats for tables.
- Job authoring: Enables AWS Glue to generate code to move data from source to destination; developers can share code to Git for version control.
- Job execution: Completes the task; developers don't need to deploy, configure or provision servers for AWS Glue. Jobs automatically run in a Spark environment.
The ETL scripts from Glue can handle both semi-structured and structured data. If AWS Glue encounters bad data, it places error rows in separate Simple Storage Service (S3) buckets instead of allowing the job to crash. If a crash occurs, the job continues from the point where it stopped; therefore, no data duplicates.
Developers can use these job-triggering techniques:
- Schedule-based, which is time-dependent and triggers at a particular time.
- Event-based, which waits for a signal from another job.
- External sources, which runs code in response to external triggers. For example, developers can code job triggers from AWS Lambda.
AWS Glue is fault-tolerant, meaning it retrieves any failed jobs. Developers can also debug logs to assess errors. The ETL service also integrates with other AWS tools and services, such as Amazon S3, Amazon Relational Database Service and Amazon Redshift.
Admins have their say on data management in AWS
Hadoop, Spark provide ETL offerings for AWS
Data replication and ETL processes vary based on enterprise requirements