Cloud data warehouse guide: Using Redshift, rival platforms
A comprehensive collection of articles, videos and more, hand-picked by our editors
Data warehouses back in the day were hugely expensive. No matter which database you deployed to, you could count...
on a cool million-dollar minimum investment to get things up and running.
A data warehouse, while often defined differently, is just a database that contains abstractions of transactional data used for making business decisions. Data analysis tools run against the data, and the output is different views of that, including reports and visualized data.
In the past, Oracle and other larger enterprise players ruled the data warehousing world. With the emergence of cloud-based solutions, such as data warehouse storage that runs on public cloud providers, the cost of building and deploying a data warehouse has been significantly reduced.
Redshift can provide fast query performance by leveraging columnar storage approaches and technology.
The Amazon Redshift service is an obvious jab at Oracle. However, it's changing the game, in terms of what's possible around the construction of data warehouses in the cloud. Redshift can provide fast query performance by leveraging columnar storage approaches and technology, much of which is taken from enterprise database technology.
A columnar database is a concept or architecture. Many columnar databases build upon traditional, row-oriented database management systems. They simply store the information in tables with one or two columns, and add the necessary layers to access the columnar data. Redshift is just an instance of that technology, but what's unique about it is the way that we consume Redshift as a public cloud service.
The use of a columnar database/storage, such as Redshift, improves I/O efficiency and parallelizing queries across many server instances. Because the service instances are expandable and on-demand, in the world of Amazon Web Services, it's a simple matter of auto- or self-provisioning the service instances needed to support the query, and then returning them once the operation is complete.
Redshift leverages standard PostgreSQL, JDBC and ODBC drivers that will support any existing clients that support SQL (Structured Query Language). Data load speed scales linearly with cluster size, with integrations to Amazon S3, Amazon DynamoDB, Amazon Elastic MapReduce, Amazon Kinesis or any SSH-enabled host. So, in other words, Redshift is a big honking columnar database that's highly scalable and highly cost-effective.
The core consideration is cost-to-value. Clearly, when leveraging cloud delivered services versus your own hardware and software, the cloud will win out most of the time. Redshift is no different, in terms of looking for a solid data warehouse that's both high-performing and cost-effective. It's tough to find both; a couple of things to consider include integration and the data itself.
Keep in mind that most, if not all, of your data resides on-premises, and thus must be moved to Redshift at some point. There are easy ways to do this; however, large amounts of data that must move daily or weekly could be more cumbersome and problematic than you think.
The data could prove to be a problem as well, if there are regulations that force you to handle the data in certain ways, including issues with placing data on public clouds. These data-compliance issues seem to be falling by the wayside now that cloud computing is more mainstream. Check into regulations specific to your business before making the move.
So, to Redshift or not to Redshift? Most of the time you should Redshift, or make sure it's a core consideration. The money and time saved will be significant, and data warehousing won't just be for rich enterprises anymore. Some say that's a good thing.
About the author:
David Linthicum is the chief technology officer (CTO) of Blue Mountain Labs, the author or coauthor of at least 13 books on computing and an internationally known distributed computing and application integration expert. He has more than twenty years of experience in the integration technology industry, most recently as CTO at Grand Central Communications. He has consulted for hundreds of major corporations engaged in systems analysis, design and development, with a concentration in complex distributed systems. You can reach him via email at email@example.com, follow him on Twitter @DavidLinthicum or view his profile on LinkedIn.