Change Data Capture using DMS

Shantanu Oak
4 min readOct 6, 2020

We can backup the mysql data in csv format using the “tab” parameter of mysqldump command.

# mysqldump -udba -pXXX test newcustomer — tab=/var/lib/mysql-files/

You will get the output that will look something like this…

# cat /var/lib/mysql-files/newcustomer.txt
1 ashok
2 akbar
3 anthony

The file can be easily uploaded to S3 using a command like this…

aws s3 cp /var/lib/mysql-files/newcustomer.txt s3://testme123/mysqlbackup/

Please note:
If you get an error while using mysqldump command, change the path. The correct path will be shown if you run this command at mysql prompt:

mysql> SHOW VARIABLES LIKE “secure_file_priv”;

Let’s assume I update the mysql table using a command like this…

update newcustomer set name = ‘ankit’ where id = 1;

Is there any way to know that original value for primary key ID 1 was “ashok” and it was changed to “ankit”? This information may be very important if the data is important and sensitive. This is almost impossible to know unless you have tried “Data Migration Service” by AWS.
_____

Here is a cloudformation template that will initiate the instance required for the task. It will also create source and destination endpoints.

It will create a new S3 bucket to store all CSV files. You can select specific tables or skip binary data for quicker migration. The only step that is required from your side is to start the task from “Actions” drop down menu as shown in this image…

Start Task from Actions drop-down

When you enable Migration, you have 2 options. 1) Full-load 2) CDC
The first one is like using the mysqldump command shown above. It simply saves the data as .csv files.
The second CDC (Change Data Capture) option is like mysql replication. It continuously monitors your data and records new values to S3. Since the old values are already written to one of the csv files, you get old as well as new records. This feature is very important for security audit and general trouble-shooting.

Just like replication, we need to enable binary logs for CDC to work. Use the following parameters:

server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
expire_logs_days = 10
max_binlog_size = 100M
binlog_format = ROW
binlog_checksum = NONE

The first 4 parameters are the same as you use for replication. But the last 2 parameters i.e. binlog format and checksum are important for CDC to work.

Let’s create a table in “test” database.

create table newcustomer (id int, name varchar(100), primary key (id));

insert into newcustomer values (1, ‘amar’);
insert into newcustomer values (2, ‘akbar’);
insert into newcustomer values (3, ‘anthony’);

update newcustomer set name = ‘ankit’ where id = 1;

As you can see, I have inserted 3 records and updated one of them. How do I save both, old and updated values? This feature is built-in if you enable CDC (Change data capture) while creating the stack using the cloudformation template mentioned above. Your table will look like this in Athena…

Primary Key ID 1 has logged both (old and new) values

“Inserted” and “Updated” records are marked separately.

This is something that the programmers try to achieve using python, Java or php code. In most of the cases the application source code becomes too complicated and database very large because they use the “log table” on the same server to store old and new records. In case of CDC, the data is saved in S3 bucket, you can easily create Athena table:

CREATE EXTERNAL TABLE IF NOT EXISTS newcustomer (
`type` string,
`id` int,
`name` string )
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
WITH SERDEPROPERTIES (
‘serialization.format’ = ‘,’,
‘field.delim’ = ‘,’
) LOCATION ‘s3://mysqltos3-oktrailbucket-npn/test/newcustomer/’
TBLPROPERTIES (‘has_encrypted_data’=’false’);

You will need to change the columns and S3 location to match your data.

While using S3 End point, you can use extra connection attributes and convert the data to parquet.
You will have to add 1 line in the properties section of the endpoint and the template will look like this…

S3Endpoint:
Type: AWS::DMS::Endpoint
Properties:
ExtraConnectionAttributes: “addColumnName=true;compressionType=GZIP;dataFormat=parquet;”

CDC or “Change Data Capture” provides you the information that traditional replication does not. A slave can be promoted if master goes down. DMS may not be useful in that case. But it serves another purpose that is important for security and audit.

--

--