freshidea - Fotolia


Kappa Architecture pushes database to next level

Amazon Aurora allows a database to scale, but Kappa Architecture takes the lid off database limitations -- allowing admins to create materialized views of data across server clusters.

Traditional databases have many advanced features, the most important of which for developers could be the Materialized...

View feature. This type of saved query automatically caches a query that is otherwise time-consuming to construct -- and writes that data to disk. But traditional databases don't scale well with regard to the amount of data they can store and the throughput in read requests they can accept. While Amazon Aurora can improve a database's ability to scale, there's still a limit to what a single database can handle. Kappa Architecture can help.

Kappa Architecture is a software architecture pattern with an append-only immutable log. A Kappa Architecture is similar to a Lambda architecture system, but without batch processing. Within a database, there is a detail called a transaction log, which is written each time a change is made to the database. The transaction log is used to rebuild parts of the database and restore it to a specific point in time if something goes wrong. The log includes every action that was taken on the underlying data in a very raw way. For example, if a customer were to purchase an item, the transaction log may indicate something like this:

at 1463086280008 row 123456 old value = [ "cust-123", 50 ] new value = [ "cust-123", 25]

The most obvious tool for creating a Kappa-like architecture is Amazon Kinesis, which is specifically built to handle streams of data. When a process attaches to a Kinesis stream, it can automatically read every event in sequence to build its own view of the database.

This value shows that at unix timestamp 1463086280008, row 123456 was updated from "cust-123", 50 to "cust-123", 25.

This is what is known as a fact, and it indicates that, at a specific time, a value was changed from one value to another. This fact is true -- no matter what happens after this. For instance, if the customer then adds 25 more credits, they still removed 25 at this given time.

Databases have handled this small detail for years, but it's mostly been a small implementation detail. The Kappa Architecture proposes to use this as the primary source of record, making everything else a materialized view of the data. While traditional databases would perform this type of activity on a single server, this method allows administrators to scale indefinitely and create materialized views of data across clusters of different servers.

IT admins can create materialized views of data in different servers.
Admins can use log streams to view data across clusters of servers.

This log stream becomes an append-only system that can only accept write operations, attaches a basic timestamp for when the write operation occurred and then writes that information to disk. The idea with this interaction is that each write operation happens in sequence, so the system would have to start from the beginning and run every single event in the log -- in order -- so it could rebuild the entire current view of the database.

This allows developers to continually keep a back-end system fully operational while adding additional views. Each view of the data can be on a completely isolated system or on multiple completely independent systems, as long as each has full access to the entire log stream. The log stream notifies each view of any and all operations as well as the exact timestamp the operation occurred, so all operations could be performed in order. Each view could then take all of the information it has to make sure the individual view is properly formed.

How Kappa works on AWS

The most obvious tool for creating a Kappa-like architecture is Amazon Kinesis, which is specifically built to handle streams of data. When a process attaches to a Kinesis stream, it can automatically read every event in sequence to build its own view of the database. This could be used to replicate changes to a database into other data stores, such as Amazon DynamoDB, or even cross regionally to other databases for regional proximity improvements.

An alternative to Kinesis is a DynamoDB stream, which is easy to use if a developer uses a DynamoDB database. Think of this as an automatic Kinesis stream specifically for DynamoDB. Unfortunately, streams from both Kinesis and DynamoDB only store records back seven days. Therefore, to bootstrap a new view, developers must first read everything from a master database that has the end result containing all modifications.

To get a deeper look at how to create a Kappa-style view, let's assume we're working with two DynamoDB tables -- Customer and Orders. The marketing department of a company may commonly ask, "How much money did customer X spend this month?" In a traditional database, finding this data would involve joining Customer tables and Orders tables and then summing up the amount that customer spent.

Taking the Kappa approach -- and because the data is in DynamoDB instead of a relational database -- this type of query would either be known ahead of time or would be something we build a new view for and keep updated. The bulk of the data is in the Orders database, but the marketing department likely doesn't want to know a Customer ID and instead would look for a customer name or email address. In this case, developers create a view that contains rows for Customer ID, Customer Name, Customer Email, Month and Total Spend. You can index this view according to name and email. DynamoDB can handle this automatically, but if we're not using DynamoDB, developers could also create two separate views -- one that indexes according to name and one that indexes according to email.

Linking Kappa Architecture with AWS Lambda

To get started, the view creation script would need to first attach to the stream, and begin to store all changes on the DynamoDB stream. These changes come through as OldValue / NewValue, so the script can identify the new value. Once the script begins to buffer the data, it can dump out all of the current data on the table and write it to a new table. After all of the data has been copied, the script will start processing events. This two-phased approach helps prevent any concurrency issues to ensure no changes are lost between when the data is copied to when the script begins processing the change events.

AWS Lambda is the most efficient way to process change events. After creating a Lambda function that will be used to create this view, go to the Event Sources tab and choose Add event source.

Admins can add an event source to an AWS Lambda function.
Select 'Add event source' from the Event Sources tab.

In the pop-up, choose DynamoDB.

IT admins can choose different Lambda event source types.
Select DynamoDB as the event source type.

Next, fill out details, making sure to choose the right DynamoDB table (Orders); keep Starting Position at Trim Horizon. The Batch size lists how many records at a given time at most will be sent to a single instance of the AWS Lambda function. This means the Lambda function will need to handle multiple events in a single execution.

AWS Lambda function event configurations.
Configure the DynamoDB table, batch size and starting position for the Lambda function.

Next, use Add Amazon DynamoDB Permissions to your role so the Lambda function can access the DynamoDB stream. Developers should also be sure to add AWS Identity and Access Management permissions to the role to allow it to write to a new DynamoDB table that will be the newly created view. Developers should test the Lambda function first; leave Enable event source on until it verifies the function works. After it's working, disable the stream temporarily using the same Event Sources console until the new view is fully initialized. Be sure to reenable the source after the view is initialized.

The last step before reenabling the stream is to copy all existing records to the new view. This involves an outside script reading every record from the Orders database and submitting it to the Lambda function. Developers may want to temporarily increase read throughput until the new view is generated so it can be created quickly. Remember that the DynamoDB Kinesis stream will only keep up to seven days of events in the log.

Finally, it's a good idea to write an additional handler to process changes to the Customer DynamoDB table, which updates any changes in the new view related to the customer, such as changes to an email address or name.

Kappa Architecture is a nice way to keep a cache of commonly requested queries available to quickly and efficiently answer questions. Unfortunately, it's not intuitive to do with AWS. AWS users can run an Apache Kafka instance and use that as the primary write-store of data, which would ensure a simpler capability of being able to build new views. DynamoDB cloud then serve as the view-only source for clients. The downside of this is that developers would have to manage Kafka and any issues that it has with scaling -- compared to Kinesis, which Amazon scales automatically.

Still, using Kafka allows companies to avoid a complete vendor lock-in, as all of the data would be available in a non-proprietary method that could be replicated to alternative data stores if the database had to be rebuilt on a different cloud platform.

Next Steps

Manage and monitor AWS Lambda code

Microservices and Lambda can work together

Pair open source dev tools with AWS Lambda

Dig Deeper on AWS Lambda