Amazon Athena

Amazon Athena is a service that enables a data analyst to perform interactive queries in the Amazon Web Services public cloud on data stored in Amazon Simple Storage Service (S3). Because Athena is a serverless query service, an analyst doesn't need to manage any underlying compute infrastructure to use it.

There is also no need to load S3 data into Amazon Athena or transform it for analysis, which makes it easier and faster for an analyst to gain insight. A data analyst accesses Athena through either the AWS Management Console, an application programming interface (API) or a Java Database Connectivity driver; he or she then just defines the schema and can start to execute SQL queries on S3 data.

An administrator can manage access to Athena via AWS Identity and Access Management policies, access control lists and Amazon S3 bucket policies. An Athena user can query encrypted data with keys managed by AWS Key Management Service, and can also encrypt query results. Athena also enables cross-account access to S3 buckets owned by another user.

In addition, Athena uses managed data catalogs to store information and schemas related to your searches on Amazon S3 data.

Supported data types and integration

Amazon Athena relies on the open source Presto distributed SQL query engine to enable both quick ad-hoc analysis and more complex requests, including window functions, large joins and aggregations. Athena can process both unstructured and structured data types, including formats like CSV, JSON, ORC, Parquet and Avro. Athena also supports compressed data in Snappy, Zlib, LZO and GZIP formats.

Athena integrates with other services in the AWS portfolio. For example, you can use it with Amazon QuickSight to visualize data, or with AWS Glue to enable more sophisticated data catalog features, such as a metadata repository, automated schema and partition recognition, and data pipelines based on Python. Athena itself uses Amazon S3 as an underlying data store, which provides data redundancy.

Amazon Athena vs. Redshift, other services

Amazon Redshift, AWS' data warehouse service, addresses different needs than Athena. Redshift handles more complex, multi-part SQL queries, and is a better fit for an organization that needs to combine data from disparate sources into a common format. Redshift fits with business intelligence workloads and enterprise reporting, while Athena is better suited for simpler, ad-hoc queries on S3 data.

Amazon Elastic MapReduce (EMR) enables teams to run distributed data processing frameworks, like Hadoop, Spark and Presto. EMR goes beyond data queries, and it is better suited for projects that require custom code, specific cluster configurations or extremely large data sets. However, you can use Athena to query data processed by EMR without impacting ongoing EMR jobs.

This was last updated in January 2018

Continue Reading About Amazon Athena

Dig Deeper on AWS big data and data analytics