AWS Glue is a cloud service that prepares data for analysis through automated extract, transform and load (ETL) processes.
The service can automatically find an enterprise's structured or unstructured data when it is stored within data lakes in Amazon Simple Storage Service (S3), data warehouses in Amazon Redshift and other databases that are part of the Amazon Relational Database Service. Glue also supports MySQL, Oracle, Microsoft SQL Server and PostgreSQL databases that run on Amazon Elastic Compute Cloud (EC2) instances in an Amazon Virtual Private Cloud.
The service then profiles data in its Glue Data Catalog, which is a metadata repository for all data assets that contains details such as table definition, location and other attributes. A team can also use the Glue Data Catalog as an alternative to Apache Hive Metastore for Amazon Elastic MapReduce applications.
To pull metadata into the Data Catalog, the service uses Glue crawlers, which scan data stores and extract schema and other attributes. An IT professional can customize crawlers as needed.
After data is cataloged, it is searchable and ready for ETL jobs. AWS Glue includes an ETL script recommendation system to create Python and Spark (PySpark) code, as well as an ETL library to execute jobs. A developer can write ETL code via the Glue custom library, or write PySpark code via the AWS Glue Console script editor.
A developer can also import custom PySpark code or libraries. The developer can also upload code for existing ETL jobs to an S3 bucket, then create a new Glue job to process the code. AWS also provides sample code for Glue in a GitHub repository.
Schedule, orchestrate ETL jobs
AWS Glue jobs can execute on a schedule. A developer can schedule ETL jobs at a minimum of five-minute intervals. AWS Glue cannot handle streaming data.
If a dev team prefers to orchestrate its workloads, the service allows scheduled, on-demand and job completion triggers. A scheduled trigger executes jobs at specified intervals, while an on-demand trigger executes when prompted by the user. With a job completion trigger, single or multiple jobs can execute when a job finishes. These jobs can trigger at the same time or sequentially, and they can also trigger from an outside service, such as AWS Lambda.
AWS Glue pricing
AWS charges users a monthly fee to store and access metadata in the Glue Data Catalog. There is also a per-second charge with AWS Glue pricing, with a minimum of 10 minutes, for ETL job and crawler execution. AWS also includes a per-second charge to connect to a development environment for interactive development.