Andrea Danti - Fotolia
One drawback to the fast pace of AWS' evolution is that it can be a challenge to keep up with new product announcements and maintain a sense of its services. One such case is in the difference between Amazon Kinesis Streams, formerly called Kinesis, and Kinesis Firehose, a pair of AWS big data analytics products.
Amazon Kinesis was introduced a little over two years ago as a service for ingesting and processing large volumes of data from a potentially large number of sources. With Kinesis Streams, an administrator sets up a publisher to create records, a stream -- with multiple shards -- to receive the records, a program to read the records from the stream and, typically, a DynamoDB table to maintain information about which records have been read. There are a fair number of components to track. The Kinesis Streams sample application, for example, has 317 lines of code in its AWS CloudFormation template.
At first glance, Kinesis Firehose might appear to be a simplified version of the Streams system. In fact, when unveiling Kinesis Firehose at re:Invent 2015, one of the prominent selling points of the AWS big data analytics tool was the ability to spend more time on an application and less time on infrastructure -- which is one of AWS' ongoing messages. While it's true that Kinesis Firehose is simpler to set up than Streams, it misses the main distinction between the two AWS big data analytics products.
Kinesis Streams is designed to process large volumes of incoming data, more or less in real time. The sample application, for example, receives records and builds a real-time graph. The raw data itself is not necessarily retained. On the other hand, Kinesis Firehose is designed to import large volumes of data in Amazon Simple Storage Service or Amazon Redshift; raw data retention is the whole point of the service. Generally, it's assumed that an administrator will run various forms of data analysis on the data; however, that's outside the scope of Firehose.
In some ways, this mimics the distinction between a data mart and data warehouse -- or data lake. A data mart is a collection of data that usually has been preprocessed and often is about a particular subject. A data warehouse tends to hold a wider array of unprocessed data. And a data lake takes that even further -- holding data that may or may not be valuable based on later business decisions. Kinesis Streams and Kinesis Firehose address somewhat opposite ends of that spectrum.
Because Kinesis Firehose does not process incoming data, it has a simpler configuration. In Kinesis Streams, the admin must specify how many shards are required to form the distributed set of record receivers. A Kinesis Streams application needs to maintain a persistent store, usually in DynamoDB, to keep track of which records have been read and from which shards. While not especially difficult, it does represent another moving part that an admin could get wrong; careful monitoring and management are necessary.
In Kinesis Firehose, an administrator still creates or configures record processors but then specifies a stream from which to receive records. The stream itself is a managed service; IT teams can monitor it with a variety of CloudWatch metrics, but AWS maintains the service. Firehose also introduces a Kinesis Agent, which is a Java-based application that can act as a record source. This can provide a simple way to send log records directly to a stream. The Kinesis Agent can also work with Kinesis Streams. In fact, some applications may choose to configure the agent to send data records to both Streams and Firehose if they need both real-time analysis of AWS big data analytics and long-term storage of raw data.
Process data in real time with Amazon Kinesis
When to use AppStream vs. Kinesis