AWS Glue

Contributor(s): David Carty

AWS Glue is a cloud service that prepares data for analysis through automated extract, transform and load (ETL) processes.

The service can automatically find an enterprise's structured or unstructured data when it is stored within data lakes in Amazon Simple Storage Service (S3), data warehouses in Amazon Redshift and other databases that are part of the Amazon Relational Database Service. Glue also supports MySQL, Oracle, Microsoft SQL Server and PostgreSQL databases that run on Amazon Elastic Compute Cloud (EC2) instances in an Amazon Virtual Private Cloud.

The service then profiles data in its Glue Data Catalog, which is a metadata repository for all data assets that contains details such as table definition, location and other attributes. A team can also use the Glue Data Catalog as an alternative to Apache Hive Metastore for Amazon Elastic MapReduce applications.

Content Continues Below

To pull metadata into the Data Catalog, the service uses Glue crawlers, which scan data stores and extract schema and other attributes. An IT professional can customize crawlers as needed.

ETL engine

After data is cataloged, it is searchable and ready for ETL jobs. AWS Glue includes an ETL script recommendation system to create Python and Spark (PySpark) code, as well as an ETL library to execute jobs. A developer can write ETL code via the Glue custom library, or write PySpark code via the AWS Glue Console script editor.

A developer can also import custom PySpark code or libraries. The developer can also upload code for existing ETL jobs to an S3 bucket, then create a new Glue job to process the code. AWS also provides sample code for Glue in a GitHub repository.

Schedule, orchestrate ETL jobs

AWS Glue jobs can execute on a schedule. A developer can schedule ETL jobs at a minimum of five-minute intervals. AWS Glue cannot handle streaming data.

If a dev team prefers to orchestrate its workloads, the service allows scheduled, on-demand and job completion triggers. A scheduled trigger executes jobs at specified intervals, while an on-demand trigger executes when prompted by the user. With a job completion trigger, single or multiple jobs can execute when a job finishes. These jobs can trigger at the same time or sequentially, and they can also trigger from an outside service, such as AWS Lambda.

AWS Glue pricing

AWS charges users a monthly fee to store and access metadata in the Glue Data Catalog. There is also a per-second charge with AWS Glue pricing, with a minimum of 10 minutes, for ETL job and crawler execution. AWS also includes a per-second charge to connect to a development environment for interactive development.

This was last updated in November 2017

Continue Reading About AWS Glue

Dig Deeper on AWS big data and data analytics

Join the conversation

1 comment

Send me notifications when other members comment.

Please create a username to comment.

How do you speed up data processing in AWS?


File Extensions and File Formats

Powered by: