Tags: Amazon Redshift, AWS Schema Conversion Tool, Improving the performance, optimization
Often database administrators experience decreases in query performance with Amazon Redshift as the size of data increases. The problem becomes even more significant when we start talking about regularly executed queries. Design time decisions that were made with the given the statistics at the start of the project may not make sense with the new data and query patterns. So, this means that the Amazon Redshift cluster requires some optimization. See how these Amazon Redshift performance tuning tips can improve query execution times when users start noticing performance issues.
Amazon introduced a feature called Redshift Optimization for the AWS Schema Conversion Tool (SCT) with the November 17, 2016 release. You can now use the Schema Conversion Tool the optimization of existing Amazon Redshift deployments! This dramatically reduces the manual effort of addressing performance issues with your Redshift deployments.
Here’s why you should consider taking advantage of the Redshift Optimization option in the AWS Schema Conversion Tool.
Preparing the environment
We decided to use Hammer DB to run the TPC-H data warehouse test to see how well the Redshift Optimization option can improve query performance.
NOTE: We are only using HammerDB for comparison purposes to illustrate how changes in the configuration of Redshift impacts performance. This cannot be considered as an official TPC-H benchmark run that you can use to compare with other database platforms.
To run the test, we installed HammerDB on a Windows Server 2012 R2 with a m4.xlarge instance on Amazon EC2 to generate the data and run the queries. We created a Redshift database using a ds2.xlarge.
In HammerDB, we selected the TPC-H configuration which creates 8 tables for a modified run against Amazon Redshift with 13 queries out of the standard 22 queries used for an official TPC-H run.
To build the schema in Hammer DB as shown below, we created all tables with ‘even’ distribution style and without sorting keys initially.
After creating empty database schemas, we used the application’s Datagen option to generate the data files with 1000 scale factor. The result was a set of CSV files for each table that were approximately 1.1 terabytes in size.
On the next step we packed these generated data files into the archives, using the following command:
1 | gzip.exe -r D:\HammerDB\*.tbl |
This reduced the size of the initial data set to around 320 GB.
We utilized the S3 browser to copy these archive files to the Amazon S3 located in the same region as the Redshift server to minimize network latency. We then created specific scripts to upload the data to Redshift using the COPY command. It was nice that Redshift uses the LZOP compression for uploaded files that resulted in a data set of just 200 GB.
Now everything was set to start the HammerDB benchmark.
What happens if your initial optimization strategy results in poor performance
In the first run, we configured the TPC-H driver to run with just one user and one query set for the 13 queries as shown below.
The first run took a little over 27 hours to complete! Here are the results for each of the queries in the first run. Note, there was no warm up run prior to running the benchmark.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | Hammerdb Log @ Sun May 28 18:15:41 -0000 2017 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- Hammerdb Log @ Sun May 28 18:16:23 -0000 2017 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- Vuser 1:Executing <strong>Query (1</strong> OF 13) Vuser 1:10 ROWS returned IN 355.67 seconds Vuser 1:Executing <strong>Query (2</strong> OF 13) Vuser 1:1 ROWS returned IN 229.854 seconds Vuser 1:Executing <strong>Query (3</strong> OF 13) Vuser 1:1 ROWS returned IN 253.078 seconds Vuser 1:Executing <strong>Query (4</strong> OF 13) Vuser 1:1 ROWS returned IN 4925.949 seconds Vuser 1:Executing <strong>Query (5</strong> OF 13) Vuser 1:1 ROWS returned IN 29.741 seconds Vuser 1:Executing <strong>Query (6</strong> OF 13) Vuser 1:100 ROWS returned IN 1104.394 seconds Vuser 1:Executing <strong>Query (7</strong> OF 13) Vuser 1:100 ROWS returned IN 31989.634 seconds Vuser 1:Executing <strong>Query (8</strong> OF 13) Vuser 1:0 ROWS returned IN 0.829 seconds Vuser 1:Executing <strong>Query (9</strong> OF 13) Vuser 1:0 ROWS returned IN 1.075 seconds Vuser 1:Executing <strong>Query (10</strong> OF 13) Vuser 1:0 ROWS returned IN 26868.327 seconds Vuser 1:Executing <strong>Query (11</strong> OF 13) Vuser 1:0 ROWS returned IN 26372.198 seconds Vuser 1:Executing Query (12 OF 13) Vuser 1:100 ROWS returned IN 6070.607 seconds Vuser 1:Executing <strong>Query (13</strong> OF 13) Vuser 1:3480 ROWS returned IN 343.742 seconds Vuser 1:Completed query SET IN <strong>98545</strong> seconds Vuser 1:Geometric mean OF query times returning ROWS IS 846.62011 |
From the perspective of our extensive experience, we’ve noticed that this benchmark usually runs much faster on the hardware of the m4.xlarge cluster by choosing the appropriate distribution style and sort keys. To see what the Redshift optimizer suggests for each of the problematic queries, you can run the following query:
1 2 3 4 5 6 7 8 9 | SELECT DISTINCT query, SUBSTRING(event,0,200) AS event, SUBSTRING(solution,0,200) AS solution FROM stl_alert_event_log WHERE query IN (3990653,3990668,3990683,3990699,3990907,3990909,3990961, 3992326,3992327,3992328,3993487,3994626,3994986) ORDER BY query; |
Here’s what we discovered:
What the Amazon Redshift optimizer does is to look for ways to minimize network latency between compute nodes and minimize file I/O latency when reading data.
The disk storage in Amazon Redshift for a compute node is divided into a number of slices. The number of slices per node depends on the node size of the cluster. When you load data into a table, Amazon Redshift distributes the rows of the table to each of the node slices according to the table’s distribution style that can have the following values: ‘even’, ‘key’ or ‘all’.
For an even distribution style, the node distributes data evenly regardless of the values in any particular column. The ‘all’ distribution option, makes a copy of each table will be distributed to every node. The ‘key’ distribution option distributes rows according to the values in one specified column. The idea is to pick a distribution key column that groups commonly accessed data used in a WHERE clause that keeps data together within node slices.
When you create a table, you can define one or more of its columns as sort keys. When data is initially loaded into the empty table, the rows are stored on disk in sorted order.
Amazon Redshift performance tuning tips to optimize cluster storage
These Amazon Redshift performance tuning tips using Redshift optimization requires several steps to optimize the Redshift Sort Keys and Optimization Strategy for you cluster storge.
Duplicating the original cluster. We created a snapshot of the original Redshift cluster in the AWS console. We created a new Redshift cluster and used the default option of the AWS console to restore the cluster from the snapshot. We now have the original and the newly created clusters as the source and the target respectively in the AWS SCT project. Moreover, we wouldn’t need to migrate the data because we have already restored it from the backup snapshot.
Collecting statistics from the original database. We used the offline statistics collection, and then uploaded the files to AWS SCT. However, SCT provides an option to collect statistics online, directly from the tool.
Running the optimization. Amazon recommends using different clusters as the source and target for the optimization project. Meanwhile, SCT allows you to select the migration strategy when you launch the optimization: you can opt for ‘Migration to a copy’ or ‘Migration to a clean slate’. In our case, we selected the first option because we have already created the copy of the original cluster.
To launch the optimization, we selected the ‘Run optimization’ option in the pop-up menu of the SCT. We used the default optimization strategy.
According to Amazon, the optimization parameters by default are set based on the best practices. So, we’ll have a chance to check the optimization result based on the default values.
We’ve set speeding up the ‘expensive’ or ‘heavy’ queries as the main goal of the optimization procedure.
The results of optimization included SQL scripts to recreate tables. Here’s an example of the script for one of the tables.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | CREATE TABLE IF NOT EXISTS public.nation( n_nationkey NUMERIC(18,0) NOT NULL, n_name CHARACTER(25), n_regionkey NUMERIC(18,0), n_comment CHARACTER VARYING(152) ) DISTSTYLE KEY DISTKEY ( n_nationkey ) SORTKEY ( n_name ); |
As you can see, the generated SQL code now includes the sort keys and the distribution keys.
The results show more than 8 times performance increase!
In AWS Schema Conversion Tool, you may save the SQL script of the new tables or apply it directly to the target cluster. We selected the second option and applied the code to the target. This operation took around 2 hours, and now we could run the HammerDB benchmark once again, using the same parameters as in the first run on the non-optimized database.
Here are the results of the HammerDB benchmark compared to the first run.
As you can see, SCT dramatically improved performance of the ‘expensive’ or ‘heavy’ queries in the new cluster where the tables included the sort and the distribution keys. The HammerDB benchmark completed in about 3 hours. This was an 8.2 times performance gain!
It was great to see that SCT with the new Redshift Optimization feature made good decisions on distribution styles and sort keys on the queries that performed poorly. The fact that there were three queries that ran relatively slower was expected.
NOTE: You can’t always expect an 8 times performance increase using these Amazon Redshift performance tuning tips with Redshift Optimization. These results are based on a specific benchmark test and won’t reflect your actual database design, size, and queries. However, the Redshift Optimization eliminated the manual steps just about all the labor-intensive steps outlined in the Amazon Redshift tutorial for Tuning Table Design.
Summary
As you can see, using these Amazon Redshift performance tuning tips with the AWS Schema Conversion Tool Redshift Optimization feature greatly reduces the time needed to tune query performance with good results. In our case, we showed how the Redshift Optimization feature improved the performance of queries by 8.3 times by providing recommendations for distribution types and sort keys based on historical information of query execution.
It’s well worth the minimal effort to run the Redshift Optimization against your existing Amazon Redshift databases. If you do, please share your results with us!
If you have any questions about this great feature in AWS SCT, feel free to contact DB Best at info@dbbest.com. DB Best has migrated more databases using SCT than any other partner in the world. So, if you want to learn more about how DB Best can help you with migration projects, check out our Jumpstart program at https://landing.dbbest.com/aws-migration.