BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Data analysis tools continue to evolve from simple decision support systems to contemporary, cloud-based database services. There currently are two Amazon data warehouse services adept at handling data analysis workflows: Amazon Redshift and Amazon Relational Database Service.
Although Amazon Redshift and Relational Database Service (RDS) both target analytics, they have specific use cases. Here are four questions to ask when trying to decide which is best for your enterprise.
1. Can you load data the way Amazon Redshift does?
If you have an existing data warehouse or data mart, you probably have invested time and effort in creating custom ETL scripts to load the data store. If you plan to use Redshift, you'll have to adapt your scripts to how Redshift loads data.
The preferred way to load data into Redshift is through remote hosts, Simple Storage Service or Amazon Web Services Elastic Map Reduce using the COPY command. The COPY command executes loads in parallel and has the option to compress data during the load process.
If you have complicated business logic in ETL scripts that had loaded directly to your existing database, you can redirect the final output to local files or S3 and use COPY to load from there.
Redshift might require some changes to your ETL process, but if you can reuse the ETL scripts and then implement business logic, it's worth the effort to use Redshift, especially if you are working with large volumes of data.
2. Are you willing to adapt to a PostgreSQL-based database?
RDS supports several types of databases, including MySQL, Oracle, SQL Server and PostgreSQL. If your analytics database uses RDBMS-specific features, such as Oracle text for querying free-form text attributes, you will need to find a Redshift alternative. For example, you might need to replace your RDBMS-specific text search code with calls to Amazon CloudSearch. Of course, you also will need to configure and manage CloudSearch services along with Redshift.
Alternatively, you could use the same relational database you deployed for your existing analysis data store in RDS and continue to use RDBMS-specific features.
In addition to high-level services such as search, you can use finer-grained features related to indexing and constraint enforcement. Redshift uses the UTF-8 character, for example, while newer versions of SQL Server support UTF-16, including supplementary characters. Mapping your data to UTF-8 will present challenges if you've been using UTF-16 or related encodings.
If you're willing to work with a PostgreSQL-based database and you can take advantage of other Redshift features, Redshift should be the platform for your data warehouse.
3. Does your data warehouse require scalability and high availability?
Scaling a relational database is an administrative challenge. You can always scale vertically by deploying your Amazon data warehouse to a larger server -- until you use the largest server and still cannot meet your performance requirements. Vertical scaling does not help to improve availability. A single server can still be a single point of failure.
If you are well-versed in maintaining hot standbys for high availability and read replicas for improving query performance, then you are probably ready to manage a scalable RDS setup. You are responsible for managing many of the database administrator aspects of an RDS database; implementing high availability and scalability will add to that workload.
Amazon Redshift adds to servers to a cluster to scale easily. Adding nodes to a Redshift cluster provides linear or near-linear performance improvements. If you are using two nodes and add two more, you can expect two times the performance. Redshift also manages recovery in the event of a failed node. The disabled node is replaced with a new node and data is restored from S3 automatically.
If scalability and availability are important, Redshift is likely a better option.
4. Does your application enforce integrity constraints?
It may be hard to imagine a relational database management system that does not enforce basic integrity constraints, but Redshift doesn't. This is for performance reasons. It is easy to imagine latency issues that would be introduced if constraints had to be verified by querying multiple nodes and waiting for a response before a write operation completes.
This leaves load programs with the responsibility to perform checks for unique, foreign key and primary key constraints. If your application is resilient to at least some violations in constraints, you could assume the risk associated with not checking. If constraint enforcement is important to your application, consider an RDS-based deployment.
Redshift -- an Amazon data warehouse service -- is ideal for many cloud-based data warehouses. If you can work within the constraints of Redshift, it is good option. And if you are building a data warehouse with large volumes of data, then it may be your only reasonable choice. RDS is a better choice for small data marts that need the database to enforce constraints and support RDBMS-specific features.
About the author:
Dan Sullivan, M.Sc., is an author, systems architect and consultant with more than 20 years of IT experience. He has had engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education. Dan has written extensively about topics that range from data warehousing, cloud computing and advanced analytics to security management, collaboration and text mining.