The state of the enterprise cloud and prepping for AWS re:Invent 2013
A comprehensive collection of articles, videos and more, hand-picked by our editors
Amazon Redshift, a data warehouse service by Amazon Web Services (AWS), might appeal to some, but keep in mind
how it differs from an on-premises version. There is much in Redshift's favor as an alternative to on-premises data warehouses, especially if you understand distinctive characteristics of the service and use those to your advantage.
Amazon Redshift brings data warehousing to the realm of Platform as a Service offerings. The service is based on a customized version of PostgreSQL, an open source relational database with features that rival commercial relational database management systems. Relational databases have supported clusters of servers for many years, but early versions were difficult to implement and manage.
Redshift was designed to alleviate the frustrating, time-consuming challenges database clusters have imposed on database administrators (DBAs) in the past. DBAs use the Amazon control panel to create clusters with up to 16 compute nodes, each configured to persistently store either 2 terabytes or 16 TB.
Lower storage costs for Redshift customers
Redshift is a columnar data store, so data is organized by columns rather than rows when stored on disk. This can reduce the number of I/O operations required when selecting data based on some column value -- such as selecting all products with sales greater than 10,000 units last month -- and allows for more effective data compression, which ultimately leads to lower storage costs for customers.
Redshift was designed to alleviate the frustrating, time-consuming challenges database clusters have imposed on database administrators in the past.
Like any good sales offer, Amazon Redshift comes with the attention-grabbing promise to run your data warehouse for $1,000 per terabyte per year. Not surprisingly, it's beneficial to study the offer. That $1,000 per terabyte per year is the cost of running on a 2 TB node with reserved pricing. If you work with a small data warehouse, you might be able to get by with a single-node data warehouse. Only the 2 TB instance, known as the dw.hs1.xlarge server, is available in a single-node configuration; the 16 TB instance, dw.hs1.8xlarge, is reserved for clusters.
In addition to the compute nodes, which store data and execute queries, you will need a leader node. The leader node receives queries from clients, creates execution plans, sends queries to compute nodes and aggregates the results. Amazon only charges for the compute nodes; there is no charge for leader nodes.
Pricing is based on virtual machine size. The on-demand rate of the 2 TB node is $0.85 per hour, while the 16 TB node will cost you $6.80 per hour. Amazon Redshift is available in the U.S. East, U.S. West and E.U. (Ireland) availability zones.
Reserved instances can cut your costs, but they do require a one- or three-year commitment when purchased directly from AWS. Instances can also be purchased through another customer in the Amazon Marketplace. Customers selling their instances set the prices and determine the types of machine instances offered in the marketplace.
Since Amazon Redshift is fairly new, you may not find many data warehouse instances right now. Alternatively, you may decide to buy a reserved instance with the assumption that you can always try to sell any unused and unnecessary time on your contract.
The price of the data warehouse nodes includes the cost of storage on the compute nodes, as well as equal amounts of Amazon Simple Storage Service (S3) storage for backup. If you store more than the amount of storage in your data warehouse, you will incur standard S3 charges for the excess storage.
Database maintenance through Amazon Redshift
Redshift's performance coincides with the ability to parallelize operations. Queries are distributed across nodes in the cluster so each node does a fraction of the overall workload. Do not skew the workloads with the distribution of data across nodes. By default, Redshift will use a round-robin algorithm to distribute data across all nodes in the cluster. If you choose to instead distribute data based on a key in your table, then select the key carefully to avoid bottlenecks in a subset of nodes.
All databases need some level of maintenance, and Redshift performs some of the most common tasks, including performing backups and patching software. DBAs will still need to monitor performance related to database design and data loading. RedShift performs update operations by deleting existing rows and appending new rows to a table. This can help improve runtime performance but also lead to fragmentation in storage. DBAs will need to run the VACUUM command periodically to reclaim unused space. DBAs should also become familiar with the ANALYZE command for examining query-execution plans, a key technique to understanding slow-running queries.
About the author
Dan Sullivan, M.Sc., is an author, systems architect and consultant with more than 20 years of IT experience. He has had engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education. Dan has written extensively about topics that range from data warehousing, cloud computing and advanced analytics to security management, collaboration and text mining.