There is no debating the assertion that Hadoop has brought big data analytics to the masses. Given the need to crunch a massive amount of data and potentially leverage massive processing power spread out across cheap commodity hardware, there are few options available that are as popular and as egalitarian as Hadoop's MapReduce model. But no matter how much one might admire how effective Hadoop is, there's no getting around the fact that MapReduce is, at its heart, a batch process. It is not a process that has provided fast, dependable, reliable and, most importantly, predictable response times. When data is fed to Hadoop, the data is crunched and results are generated, but the amount of time required to process a given request is indeterminate, and as such, organizations have had a very difficult time building scalable, real-time solutions around Hadoop.
Amazon has always supported the running of Hadoop applications on their cloud, giving clients the ability to take advantage of Amazon's massive processing power. But even with Amazon's Elastic MapReduce project, the aspiration of achieving real-time processing has remained elusive. Amazon has decided to change that. Never being content in allowing a big data problem to go unsolved, Amazon has introduced an ambitious new project named Kinesis that promises to bring real-time responsiveness to big data processing in the cloud.
The third phase of big data processing
Kinesis addresses this new, third phase of big data processing that the industry is currently going through. The first phase was simply figuring out how organizations could store and manage the massive amounts of data they were generating. Phase one produced a wide variety of NoSQL-based systems that managed to store massive amounts of data cheaply, quickly and effectively.
The second phase brought in technologies like MapReduce, as organizations wanted to process all of the data that phase one successfully had them storing. The third phase of big data processing is now upon us. Organizations want to not only process massive amounts of data while it is being created, but they also want to respond to clients with real-time results. "A pattern began to emerge as we talked with customers" said Ryan Waite, GM of AWS Data Services at the AWS Summit in San Francisco. "First, collect all the data. Then start to run some sort of dashboard on top of it, and then start to respond in real time to the data set."
Dealing with data in a new way
So what does Kinesis do differently? First off, all incoming data is stored immediately across several availability zones, a serious departure from other big data solutions that embrace the philosophy of eventual consistency. And once stored, that data is available for a 24-hour window where it can be "read, re-read, backfilled and analyzed." The difficult job of load balancing, fault tolerance and managing distributed systems is handled by the framework, allowing developers to concentrate on building applications that can solve big data problems in real time. The solution cannot only manage massive streams of incoming data, but it can make that data available for immediate processing and subsequent sorting and storage.
The industry is moving beyond simply storing big data and processing it in batch mode. With Kinesis, Amazon is bringing real-time analytics to big data.