BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
"Big data" is, well, big. The massive amount of structured and unstructured information -- typically more than a petabyte -- will bog down most traditional approaches to data management. The on-premises costs would break the budgets of most Forbes Global 2000 companies as well as government agencies.
That's where the cloud comes in. Cloud providers such as Amazon Web Services are offering powerful, cost-effective approaches to support and analyze big data. Typically priced on a per-use basis, these cloud services are set to revolutionize the ways we understand our own businesses.
This is not just data formatted and restructured to drive useful reporting; it's operational data that provides a real-time look at the business. We can also link this analytics functionality to dynamic business processes that can make organizations "self-healing" or "self-optimizing." That is where the true value exists.
AWS's big data analytics offerings are somewhat confusing. I spend a great deal of my time explaining my understanding of the company's lineup to my clients: what's there and why.
Go with the flow
Data integration is the first problem you need to consider when doing big data analytics in the public cloud, whether it's with AWS or another provider. Your data needs to flow from operational data stores in your organization to your big data systems, most likely in the cloud.
AWS supports data transfer services, such as AWS Direct Connect, which can move big data into and out of the cloud. It doesn't move it quickly, but because it's free it's fine when you don't need real time.
Another middleware-type service is Amazon Kinesis. This is a cloud service for real-time processing of streaming big data. It supports data throughput from megabytes to gigabytes of data per second, and it can handle streams from hundreds of thousands of different data sources. Think of running several data streams from multiple data sources in your organization to your database of choice on AWS.
Moving from the middleware to the actual databases, the AWS services catalog has a mix of SQL and NoSQL database technology. Amazon DynamoDB is a managed NoSQL database service that many organizations have found valuable. DynamoDB has a guaranteed throughput and single-digit millisecond latency that makes it a good fit for big data projects where quick interaction with the data is a must-have, such as mobile computing support.
Places for big data
If you're looking for simplicity, then Amazon Relational Database Service (RDS) is a well-designed relational database that can scale in the AWS cloud. RDS is a good fit for big data systems that need to stay in the relational model and won't get to the petabyte-scale -- most won't. For that, you need Amazon Redshift (take that, Oracle), which is a petabyte-scale database designed and built specifically to support big data analytics and traditional data warehousing.
Redshift leverages a columnar storage technology and distributed queries, which should be familiar to people who manage on-premises data warehouses. But Redshift costs less than $1,000 per terabyte per year.
AWS provides several public cloud-delivered options for big data analytics in the cloud. Most requirements can be met with AWS technology, but AWS is not the only cloud that provides big data technology on demand. Google and Microsoft have competitive systems, and some of the smaller players have interesting offerings as well. But AWS offers one-stop shopping for the architects and developers who build big data systems -- and the catalog of database services and middleware is compelling.
About the author:
David "Dave" S. Linthicum is senior vice president of Cloud Technology Partners and an internationally recognized cloud industry expert and thought leader. He is the author or co-author of 13 books on computing, including the best-selling Enterprise Application Integration. Linthicum keynotes at many leading technology conferences on cloud computing, SOA, enterprise application integration and enterprise architecture. His latest book is Cloud Computing and SOA Convergence in Your Enterprise: A Step-by-Step Guide.