Problem solve Get help with specific problems with your technologies, process and projects.

Improve query performance, cut costs with Amazon Athena

Amazon Athena can query more data formats than its competitor Google BigQuery, but developers must choose easy-to-process formats to enjoy Athena's benefits.

Data is more valuable than ever, and modern IT teams must ingest, manage and analyze data quickly. Enterprises...

understand the value of data analytics -- particularly in the cloud -- but many still run into common financial and performance barriers.

Not all data is good data. Improperly structured data can result in high costs for AWS customers. Several AWS tools can optimize data to improve query performance and reduce costs -- and pair well with Amazon Athena, AWS' interactive SQL-based query service.

Data analysts use Amazon Athena to query large amounts of data stored in Amazon Simple Storage Service (S3) with a simple SQL interface. Athena, which is entirely serverless, requires no provisioning and can quickly handle petabyte-scale searches. Athena is Amazon's answer to Google BigQuery, which has grown in popularity since its launch in 2011.

Unlike BigQuery, Amazon Athena doesn't require that companies store data in a specific format, nor does it require that developers write custom code to upload directly to it. Instead, Athena executes queries directly from files stored in S3. Developers can optimize data for Athena to improve query performance and reduce costs; the service also supports traditional formats, such as CSV files. To get started with data queries, upload any supported document formats into an S3 bucket and point Athena at that bucket.

Data analysts use Amazon Athena to query large amounts of data stored in Amazon Simple Storage Service with a simple SQL interface.

IT teams should plan queries ahead of time and not run them repeatedly for large data sets. Because AWS charges per TB of data scanned, costs can add up quickly if an application accidentally queries large amounts of data every few minutes. To prevent this, compress and store data in a columnar format, such as Apache Parquet, before uploading to S3.

Format data in S3

Amazon Athena uses standard SQL, and developers often use big data SQL back ends to track usage analytics, as they can handle and manipulate large volumes of data to form useful reports. For example, an application could log every action a user takes in a given session and then create a log file in Amazon S3. A typical analytics log might track user ID, action name, action value, browser, location, IP address, start time and end time.

Big data management and analytics weather tumult -- with more in store

Big data management and analytics saw plenty of commotion last year, as bleeding-edge users dug deeper into machine learning, streaming architectures gained attention and cloud computing exercised greater overall influence. Please listen to this podcast, which includes a discussion about Amazon Athena.

The application could store a log in S3 from a single session as a CSV or plain text format, but Athena works more efficiently with data stored in Parquet format. Apache provides a C++ library for Parquet, as well as several adapters for other languages, like Node.js.

The node-parquet module is simple to use and shows how to compress data into Parquet format for S3 and Athena. This code writes out data to a Parquet format then uploads that data to S3:

const parquet = require('node-parquet');

var schema = {

  user_id: {type: 'int32'},

  action_name: {type: 'byte_array'},

  action_value: {type: 'byte_array', optional: true},

  browser: {type: 'byte_array'},

  location: {type: 'byte_array'},

  ip_address: {type: 'byte_array'},

  start_time: {type: 'int32'},

  end_time: {type: 'int32'},

};

var data = [

  [ 1234, 'Login', null, 'Chrome', 'Ohio', '10.0.0.1', 1496941960, 1496942080],

  [ 1234, 'Search', 'Chris Moyer', 'Chrome', 'Ohio', '10.0.0.1', 1496941960, 1496942080],

  [ 1234, 'View', 'Book: Building Applications in the Cloud', 'Chrome', 'Ohio', '10.0.0.1', 1496941960, 1496942080],

  [ 1234, 'Add to Cart', 'Building Applications in the Cloud', 'Chrome', 'Ohio', '10.0.0.1', 1496941960, 1496942080],

  [ 1234, 'Checkout', '1 Item', 'Chrome', 'Ohio', '10.0.0.1', 1496951960, 1496952080],

];

var writer = new parquet.ParquetWriter(`${session_id}.parquet`, schema);

writer.write(data);

writer.close();

writer.on('end', () => {

  s3.upload({

    Bucket: 'my-parquet-data',

    Key: '${session_id}.parquet',

    Body: fs.createReadStream(`${session_id}.parquet`),

  }).promise().catch(e => {

    // Retry logic if there's an issue, or alert

    // if nothing can be retried.

  });

});

Parquet files have a strict format, so you need to define the schema before uploading to S3. This process also ensures that files will conform to the same schema and enables Athena to more easily and correctly process data. Developers can also use GZIP compression to further improve query performance.

Query data with Athena

After uploading data to S3, create a virtual table in Athena to query the data. This instructs Athena on how to parse S3 data into an SQL-compatible format. Use regular expressions or specify that the input data is in Parquet format. Specify Parquet as the data format, and the data will automatically move into the proper table schema based on how it's stored. Be sure to specify which columns to load into Athena; choose the same format and column names created when saving the data to S3.

Partitions, which are similar to indexes, can improve query performance. For example, you might choose to partition on user ID and action name, which would quickly group all activities from a particular user or action type.

After completing the Athena configuration, any SQL queries running either through the AWS Management Console or Athena API will search S3 directly. There's no caching or synchronization process; the data automatically updates, and the IT team only pays when performing a search.

Amazon Athena is helpful for developers that don't want to maintain a running SQL server, but it's only efficient for infrequent queries. Applications running constant queries against data should load data into a full database, like Amazon Aurora, or a data warehouse, like Amazon Redshift.

Next Steps

AWS dev tools provide better insights and control

Redshift, MySQL database options to improve AWS querying

Athena among host of Amazon cloud improvements

This was last published in June 2017

Dig Deeper on AWS big data and data analytics

PRO+

Content

Find more PRO+ content and other member only offers, here.

Join the conversation

1 comment

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

What workarounds do you use to speed data queries in AWS?
Cancel

-ADS BY GOOGLE

SearchCloudApplications

TheServerSide

SearchSoftwareQuality

SearchCloudComputing

Close