freshidea - Fotolia
When an IT team replicates data, whether in files or in databases, the main objective is to improve availability...
and operations performance. AWS data replication involves moving data to a separate repository to not only consolidate all data from across the business, but also eliminate placing unnecessary demand on the underlying infrastructure of disparate systems.
Common best practices stress that IT teams should use separate dedicated infrastructure for high-performance analytics operations and business intelligence. But they can also replicate databases to a dedicated repository to allow parallelism, which has the potential to greatly accelerate queries.
AWS provides several data repository options that support enterprise data management and analytics requirements, including Amazon DynamoDB and Amazon Redshift. This tip covers AWS data replication, including common scenarios, challenges and tools.
Challenges with AWS data replication
If an application is query-intensive, admins can use replication to implement basic load balancing and improve performance. A developer can distribute workloads by sending some queries to one or more servers. But one of the main challenges with this is maintaining replica reliability with a consistent process.
This approach requires careful attention to replication design, taking into account the use case of the business. A cloud administrator needs to consider the size of the data, the frequency of read and writes, and the performance expectation of the application. Next, examine the enterprise's infrastructure capabilities, such as data processing throughput, network bandwidth and latency. Next, look at the new architecture, taking factors such as instance sizing and database structure -- typically NoSQL or relational -- into consideration.
AWS offers customers a global presence, scalability and flexibility to quickly replicate instances, databases, files and object stores. IT teams can also use snapshots to replicate resources across availability zones (AZs) and regions. There are a variety of AWS features, such as Multi-AZ deployments with the Amazon Relational Database Service, which also feature these capabilities. However, use one of the following services if the primary reason for data replication is to improve database query performance.
Multipart upload to S3
A parallel multipart upload to Amazon Simple Storage Service (S3) breaks the database upload into multiple parts. This method uses a fail-safe mechanism in which S3 only tries to process and upload parts that originally failed to do so the first time. To improve upload speed, IT teams can use multiple threads to process each part in parallel. Finally, the team can automate the entire process and schedule automatic transfer to its desired database service.
Schedule migration with AWS Data Pipeline
AWS Data Pipeline makes it easy to schedule regular data movement, such as moving daily log files from a log server to DynamoDB. AWS Data Pipeline can also perform scheduled data migration from on-premises storage to the cloud to support data replication and consolidation from multiple sources. AWS Data Pipeline holds the data stream topology, including different sources, targets and data processing methods. While replicating to S3 stores data as-is, AWS Data Pipeline can transform and enrich in-the-pipe data based on a company's ETL process requirements.
Test your knowledge: Amazon Redshift quiz
Test your knowledge of Amazon Web Services' fully managed data warehouse service with this 10-question quiz.
Use cases for high-performance querying
DynamoDB is a proprietary NoSQL database service designed for cost efficiency, high performance and availability at scale. AWS recommends uploading data to DynamoDB with AWS Data Pipeline using a service such as Hive to transform the raw data.
In DynamoDB, the parameters for a query operation and the number of matching keys determine the performance of the query. DynamoDB also supports parallel scanning, but scanning millions of records can take up to 12 seconds. With global secondary indexes, the average response time for the operation ranges from three to five seconds.
Data warehouse service Amazon Redshift offers a dedicated analytics environment for high-query performance. The simplest way to replicate data to Redshift is through S3 using the COPY command. For more advanced ETL cases, admins can use a service like Data Pipeline to extract the data from S3, process it using Amazon Elastic MapReduce, and push it to Redshift for further analysis. After loading the relational database into Redshift, a developer will need to look for the proper underlying configuration to fit database performance requirements.
Redshift offers two underlying node types: DS1 and DC1. DS1, which was previously named DW1, is a hard disk drive-based instance. By contrast, the DC1 instance -- previously DW2 -- is solid-state drive-based, which allows analysts to run simple queries on terabytes of data with sub-second response times. For more complex queries on the same scale, results typically return in less than 10 seconds. But Redshift, unlike DynamoDB, requires an IT team to manage the underlying resources.
Third-party replication tools
Several AWS partners offer automated replication, migration and processing services. These tools are designed to solve the headaches involved with integration and to enable admins to use AWS BI and analytics tools without the need for complex coding.
Vendor services include Panoply.io and ironSource Atom, which have partnered to help Redshift users to automate AWS data replication in real time at massive scale. Panoply also enables infrastructure capacity management and data processing. Other similar tools include Alooma and Snowflake.
Amazon Redshift is met with mixed reviews
Manage cloud workflows with the AWS Data Pipeline tool
Evaluate your Amazon RDS options