Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

Improve querying with Redshift, MySQL database replication options

Database replication can speed query performance. But to get it right, developers must understand all the components involved with Amazon Redshift and MySQL.

Data plays a huge role in many applications, especially when it comes to analytics. To speed and better organize...

queries, it's common to have multiple replicas of data within several types of data stores. One of the biggest advantages of replicating data within storage -- such as MySQL database replication -- is the ability to optimize system performance. But choosing the right database depends on several factors, such as what type of data a developer uses and its capacity size.

Amazon Redshift is a petabyte-scale, fully managed, cloud-based data warehouse service that integrates with other Amazon services. Its basic function is to provide a SQL-like interface for data. To avoid manual data export and import, Amazon Redshift introduced a new replication function that allows data to move in real time from MySQL to the Redshift environment through Amazon Simple Storage Service (S3).

Organizations typically choose MySQL database replication over Redshift due to MySQL's multiple data stores and improved analytical query speed. MySQL is usually optimized to handle a small number of records. Redshift, however, delivers massive parallel processing architecture and uses a columnar data store to aggregate and analyze large amounts of data. This is particularly beneficial when it comes to real-time data analytics.

Three MySQL database replication approaches

There are many approaches to MySQL database replication, including these common methods:

Organizations typically choose MySQL database replication over Redshift due to MySQL's multiple data stores and improved analytical query speed.

Full dump and load: This is the simplest approach for MySQL replication. In this approach, MySQL periodically dumps complete tables and loads them into Redshift. The corresponding Redshift table drops and recreates during the load process. While this is a straightforward approach, it comes with some potential risks. For example, the database could be locked during the period of table dumping. It can also incur high latency during the dump operation if it is loading a particularly large table. The database dumping procedure can possibly lock the table, so it's suggested to dump the tables from the read replica instead of a production database.

Incremental dump and load: Unlike the full dump, a query is set to periodically check for updates. Redshift then loads only the added and updated data, and a consolidation (merge) query reconstructs the original table. This approach only extracts and loads updates into Redshift. Periodical queries reduce the load of database replication, as only a small portion of database rows are updated between two consecutive load operations.

There are also several disadvantages to this approach. Deleted rows cannot be captured, so developers cannot return them to a query. Similarly, table alterations won't capture unless there are active queries in each iteration. A consolidation query reconstructs the original table during the replication to Amazon Redshift; replicated tables will contain an update at column for the periodical query set. The best approach is to use a read replica instead of a production database, because the necessary consolidation steps add complexities to the implementation.

Binlog replication: MySQL binlog tracks ordered logs for every operation the database performs. After the initial dump into the MySQL database, the binlog continuously streams and loads into Redshift. With this approach, the database table doesn't lock; binlog replication can also handle deleted rows and table alterations.

Dump to S3, copy to Redshift

One common approach to replicating data is to have a MySQL database dump the copy to S3 and use the Redshift copy command to load it to the AWS data warehouse. This approach is optimal with a small database.

To start this process, log in to the AWS Management Console to create and configure a Redshift cluster.

With small databases, developers can create a Redshift cluster to host a copy from S3.
Figure 1. Enter Redshift cluster details via the AWS Management Console.

Next, define the node information.

Developers determine the type and number of nodes.
Figure 2. Determine configuration for the node.

Create the Redshift cluster and then configure security groups to allow the host access. After this step, export the tables to Redshift.

In the following example, we run a statement to select all product transactions since January 1, 2016, and split them up over to one million rows per file.

mysql>SELECT * FROM test.tbl_product

WHERE DATE(created) < '2016-06-06'

AND DATE(created) >= '2016-01-01'

LIMIT 0,1000000

INTO OUTFILE '/tmp/tbl_product_1.csv'

FIELDS TERMINATED BY ','

ENCLOSED BY '"'

LINES TERMINATED BY '\n';

Then, call the procedure:

mysql>call export_csv_split('tbl_product', 1000000);

After creating the CSV file, we can move it into our S3 bucket. To do so, use the following commands:

$ mkdir ~/s3-Redshift

$ mv /tmp/*.csv ~/s3-Redshift/

You can install S3cmd, a command-line tool to manage Amazon S3 remotely. If using S3cmd, set the AWS key and secret keys.

Next, load data into Redshift. Use standard PostgreSQL to access the Redshift cluster with the provided endpoint and credentials.

$ sudo apt-get install –y postgresql

Then use the psql program to access the Redshift cluster.

$psql --host=mysql-Redshift.xxxx.xxx.Redshift.amazonaws.com --port=5439 --username=Redshift --dbname=analytical

Next, create a similar table to this in MySQL.

$Redshift> create table product (

Id integer not null,

Name varchar (64) not null,

Price integer not null,

Primary key (id),

Distkey (name),

Compound sortkey (name, price)

);

Run the copy command to access the CSV file in S3 and then load it into Redshift.

$ copy tbl_product FROM 's3://s3-rshift/s3-Redshift/tbl_product*' credentials

'aws_access_key_id=your_aws_access_key;aws_secret_access_key=your_aws_secret_key' delimiter ',' removequotes;

After the initial data load from the S3 bucket, run a vacuum command to reorganize the data and analyze commands to update the table statistics.

$ vacuum tbl_product

$analyze table_product

Data is now available in the Redshift cluster and ready for query processing. To verify the data, use a few commands from the psql client console, such as:

$ analytical=# \d tbl_product;

And, to verify whether the table loaded correctly:

$ analytical=# SELECT COUNT(*) FROM tbl_product;

There are also some open source tools, such as Tungsten and Embulk, to help move large data sets across systems. It is important to determine whether you need high throughput, high latency or frequent schema changes. If so, consider a third-party option.

Using MySQL with Redshift replication

Follow these steps to implement MySQL with the Amazon Redshift replication. First, create a MySQL table name product with three fields, such as ID, name and price.

$mysql> create table product(

-> id INT NOT NULL AUTO_INCREMENT,

   -> name VARCHAR(100) NOT NULL,

   -> price INT(40) NOT NULL,

  -> PRIMARY KEY ( id )

 -> );

This command creates a table named Product in the following format: Id    name price

1    book 100

2    toys 150

3    clothes    200

Next, create the same target table in Redshift before pulling data from MySQL. This is important because some properties, such as column types and table keys, can only be defined one time during table creation.

In Redshift, IT teams must create two tables. The first table holds the latest reconstructed state of the original MySQL table, and the second table holds all of the updates from the original MySQL table since its last consolidation. Both of these tables have the same structure, so they can be merged together for the consolidation step.

Remember, column type and table keys cannot be changed; they can only be defined during table creation. This is important because there are two types of discrepancies between the MySQL and Redshift tables.

Column and key type discrepancies in Redshift and MySQL

MySQL and Redshift have slightly different column types. For example, INT and FLOAT in the MySQL table are directly equivalent to INTEGER and REAL in a Redshift table. But for MySQL's TIME and UNSIGNED INT, there are no equivalent elements in Redshift tables. Therefore, developers must replace this column type when inserting it into the Redshift table.

With key types, the primary key constraint in MySQL is enforced; however, it isn't enforced in Redshift. Primary keys must be unique and valid; during consolidation with Redshift, there might me multiple rows with an identical primary key, which can return incorrect values when queried. Therefore, primary key should be enforced during the ETL process, which extracts data from a source system -- MySQL -- and brings it into the data warehouse -- Redshift.

Redshift depends on two key names. The distribution key is important for loading data into a table. The key allows even distribution of data across all nodes and minimizes data movement during query execution. Sort keys enable optimization of queries performance.

Create an identical table in Redshift while defining distribution and sort keys, as such:

$Redshift> create table product (

Id integer not null,

Name varchar (64) not null,

Price integer not null,

Primary key (id),

Distkey (name),

Compound sortkey (name, price),

);

After creating the destination table in Redshift, perform the initial MySQL dump and load data into the Redshift destination table. After that, all updates within the table will run from the binlog stream. IT teams must define the binlog format before the initial dump to avoid data loss or inconsistency.

To set up the binlog format, add the following lines to the MySQL server my.cnf file:

[mysqld]

log-bin=mysql-bin

binlog_format = ROW

Understanding MySQL binlog formats and tables

MySQL offers three types of binlog format.

Statement: With this format, one single query updates all of the rows of the table and appears as a single entry in a binlog.

Row: A single query updates all the rows of the table and creates an entry for each row in binlog with the row format. Developers use this format to track the state of individual rows instead of tracking of the state of the entire table.

Mixed: This option is a mixed format of statements and rows.

Before doing MySQL dump, it is important to back up the relevant tables and identify the binlog position, as binlog replication sends events from a specific point of time. To perform this, add a user with the permission of select and lock tables.

$mysql> use test;  [Here test is the name of Database]

$mysql> LOCK TABLES product;

$mysql> FLUSH product with READ LOCK;

$mysql> mysqldump test product > sql.dump; [This will dump table 'product' into 'sql.dump' and save in local disk.]

$mysql> show MASTER STATUS;

This is the output you will receive:

$Mysql> UNLOCK product;

Once the table has been dumped, use the LOAD command in Redshift to load the dumped table into Redshift.

Now it's time to consolidate the table. The consolidation table stores the latest reconstructed state of MySQL table, and the binlog table stores all the updates. The consolidation process unites the consolidated table with the binlog table and then extracts the latest state of each row. This command creates a view called id from a table name product.

$create view id as select name and value from product;

There are a few constraints here. For example, if a table structure is modified or logs are received out of order, errors could occur. So design the data pipeline carefully and monitor the MySQL log set alarm regularly for problems.

Next Steps

Amazon RDS gains cross-region replication

Know these AWS data storage options

Find AWS alternative database options

This was last published in August 2016

Dig Deeper on AWS big data and data analytics

Join the conversation

1 comment

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

How do you avoid data replication errors?
Cancel

-ADS BY GOOGLE

SearchCloudApplications

TheServerSide.com

SearchSoftwareQuality

SearchCloudComputing

Close