This content is part of the Essential Guide: AWS analytics tools help make sense of big data

Enabling real-time data processing with Amazon Kinesis

IT teams in search of real-time data processing can use Kinesis, Amazon's cloud-based managed messaging service. Here's a look at its pros and cons.

Public cloud encourages the use of distributed applications, enabling companies to take advantage of a large number of servers to run multiple systems -- from large-scale enterprise applications to targeted microservices. But how can administrators easily move all that data to the appropriate servers? One way is through a message bus; and admins looking to ingest large volumes of data in near real time from multiple sources should consider Amazon Kinesis.

Amazon Kinesis is a managed messaging service from Amazon Web Services (AWS) that offers high performance and low administrative overhead for real-time data processing. The service is designed to accept messages from a large number of sources and distribute them to a variety of consuming applications. Kinesis is modeled along the lines of Apache Kafka, which provides a publish-and-subscribe messaging service.

Amazon Kinesis setup

The first step to set up an Amazon Kinesis publish-and-subscribe platform is to define a Kinesis stream. This is typically done in the AWS Management Console.

A stream is a set of resources that receive, store and transfer messages. High-volume data streams can be divided across multiple shards, much like the process of scaling server clusters by using multiple servers. The number of shards you need depends on the average size of messages, the rate at which records are written and the number of consumer applications. The AWS Management Console features a tool to help admins estimate the number of shards needed to meet their requirements.

As with any AWS resource, you need to define access controls. Kinesis privileges allow you to specify the users and roles that can place messages in the queue, get status and details about the queue, and read from the queue.

Producer applications are granted permission to write messages to a Kinesis queue. A message consists of a value or payload that will be delivered, such as a JavaScript Object Notation structure with key value pairs, and a shard key. A shard can accept up to 1,000 messages per second; shard keys can be up to 256 characters long. The Kinesis API provides a PutRecord function, which adds one message at a time to a queue; a PutRecords function is available for adding batches of messages.

Consumer applications invoke the GetRecords API function. Typically, the targeted application runs this function continually in a loop. Each record can be up to 50 KB in size. The system can hold up to 2 MB of data in a second on a single shard. If you need additional throughput, add more shards to your stream. The GetRecords function also supports a LIMIT parameter to specify the maximum number of records to get in a single invocation. Administrators can use this to pace the volume of data accepted by the consumer application, especially during periods of peak writing to the message queue.

Where Amazon Kinesis falls short

When it comes to real-time data processing, Kinesis has a few limitations. The service keeps messages for up to 24 hours. This is different from Kafka, which can be configured to store messages for much longer time periods. IT teams should allocate sufficient resources to consuming applications to read all messages within 24 hours.

AWS CloudWatch can help monitor the load on a message queue and the throughput of consuming applications. AWS Elastic Load Balance or Auto Scaling can help ensure there are sufficient compute resources to keep up with the message stream.

If you exceed Kinesis' limits on accepting messages, you'll receive an error known as a ProvisionedThroughtputExceed exception; then the message will be rejected. If limits are exceeded during read operations, you will receive a ProvisionedThroughtputExceed error as well. To add capacity, you can add more shards to your stream.

To add shards to a stream, split existing shards. Each split operation only takes a few seconds; however, you can only split one shard at a time. This becomes an issue if you have hundreds of shards, as it will take some time to significantly increase the relative capacity of the stream.

Development and integration

Kinesis provides a REST API so you can use almost any programming language to write and read from a message queue. The AWS software development kits also provide language-specific bindings for Kinesis functions.

AWS offers several connectors to streamline integration with other services, including DynamoDB, Redshift, Simple Storage Service and ElasticSearch. Kinesis is billed by the shard hour at $0.015 per hour and by the number of PUT operations, or $0.028 per 1,000,000 PUTs.

Next Steps

AWS analytics buy sets stage for next wave of data processing

Amazon Kinesis: When to use Amazon's big data processing service

Watch this video tutorial to set up Amazon Kinesis Data Analytics

Dig Deeper on AWS big data and data analytics