AWS Schema Conversion Tool: General Approach to Migration of Data Warehouses to Amazon Redshift

06 Dec 2016 Bill Ramos Amazon Data Warehouse Solutions, Amazon Web Services, Data Warehouse, Data Warehouse Migration
Tags: AWS Schema Conversion Tool, Redshift

This post continues our video blog series on the AWS Schema Conversion Tool (SCT). In our previous blog posts, we talked about using AWS SCT for transactional database migration projects. The AWS Schema Conversion Tool also supports migrating data warehouse (DW) workloads to Amazon Redshift.

AWS Schema Conversion Tool uses a different approach to DW migration projects compared to the transactional database migration workflow. With transactional databases, you typically have stored procedures, triggers and other database objects which deal with business logic in the database. With a data warehouse, you typically don’t have these types of objects. Instead, a data warehouse has huge volumes of historical, pre-aggregated data (with storage depth of 10-15 years). To improve query performance, you would typically have partitioned tables, materialized views and columnar tables that work with a star schema dimensional model.

Amazon Redshift is different from other data warehouse solutions in that there is no CREATE INDEX command. Instead, Redshift uses features like Sort and Distribution Keys to optimize query performance. In the following video, we’ll provide you an overview on how to migrate your Oracle, Teradata, Greenplum and Netezza data warehouses to Amazon Redshift using the AWS Schema Conversion Tool. The video also covers how AWS SCT provides guidance on creating the appropriate sort and distribution keys for you based on query statistics and run time query statistics from the source database.

Be sure to check out our new AWS Schema Conversion Tool Jumpstart offer to get you up and running fast for your migration to Amazon Redshift.

Background

Amazon Redshift uses a massive parallel processing (MPP) architecture to distribute data across multiple compute nodes as shown in the diagram below.

A key consideration for optimizing storage is to minimize network latency between compute nodes and minimize file I/O latency when reading data. Let’s look at the Amazon Redshift features that you can use to optimize query performance.

Distribution key. The disk storage for a compute node is divided into a number of slices. The number of slices per node depends on the node size of the cluster. When you load data into a table, Amazon Redshift distributes the rows of the table to each of the node slices according to the table’s distribution style. The idea is to pick a distribution key column that groups commonly accessed data used in a WHERE clause that keeps data together within node slices. A common example would be an event date.
Distribution style. There are three distribution styles you can choose from: EVEN, KEY or ALL.
- Even distribution (default). The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column. EVEN distribution is appropriate when a table does not participate in joins or when there is not a clear choice between KEY distribution and ALL distribution.
- Key distribution. The rows are distributed according to the values in one column. The leader node will attempt to place matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns so that matching values from the common columns are physically stored together. You will typically use this strategy for a large fact and commonly used large dimension table pair.
- ALL distribution. A copy of the entire table is distributed to every node. ALL distribution is appropriate only for relatively slow moving tables; that is, tables that are not updated frequently or extensively.
Sort keys. When you create a table, you can define one or more of its columns as sort keys. When data is initially loaded into the empty table, the rows are stored on disk in sorted order. You can specify either a compound or interleaved sort key.NOTE: SCT does not provide support for an interleaved sort key in the user interface. If you decide that an interleaved sort key is right for your table based on the topic Comparing Sort Styles, you can use SCT to modify the generated code.
Partitioned tables. Other data warehouse platforms provide partitioned table support to optimize access to data based on a partitioning function to optimize storage of data. Amazon Redshift does not have partitioned tables, but can emulate them using a covering view over smaller tables merged together with a UNION ALL clause.SCT provides the option to migrate partitioned tables using this approach. The advantage of using smaller tables is easier to maintain since they often don’t change with the exception of the current table with the most recent data. Amazon can typically cache the most recent frequently used table data into memory, reducing overall file I/O and network latency. You will need to routinely update the view as you add a new table based on the changing data of the partition key. For a great example of how this works, check out the blog post from Periscope Data – Changing Dist and Sort Keys on Giant Tables in Redshift.
Column compression. Compression is a column-level operation that reduces the size of data when it is stored. Compression conserves storage space and reduces the size of data that you read from storage. This allows to reduce the amount of disk I/O and therefore improves query performance.NOTE: SCT does not provide a user interface for defining compression options for columns. You can modify the generated table definition using the various compression encoding options. Amazon recommends that you use Automatic Compression when loading empty tables. The AWS Database Migration Service (DMS) with Redshift as a target does not support automatic compression at this time. To take advantage of Automatic Compression, you will need to manually use the COPY command with the COMPUPDATE option set to ON for at least 100,000 rows per slide. Internally, Redshift drops and recreates the original table definition with the recommended column compression encoding. You can then truncate the table and then use AWS DMS to load data into your Redshift tables.

How SCT helps in choosing optimization strategies

AWS recognized that one of the challenges with defining the distribution key and method along with the sort keys is – you can’t alter the settings after the table is created. You need to drop and recreate the table if you need to make changes. If you have loaded the table with production data, the drop and recreate process can be time consuming for the table reloading process.

NOTE: SCT does a great job handling the drop and recreate process making sure all of the dependent objects are consistent. However, it assumes there is no data in the tables. You would need to reload the table with data after using SCT to make your changes.

To reduce the effort of having to reload data, AWS decided to invest in SCT by using all the information it can gather from the source database to make reasonable predictions for defining the distribution and sort keys for the initial table migration.

Based on our usage of SCT, we recommend choosing to use metadata and statistical information for the optimization strategy for the project settings. When we traced the queries used to retrieve metadata and statistics on the source, we noticed that it also looked at the source databases query log / query execution statistics to identify the tables and columns used in the most frequent queries.

The AWS SCT Optimizations Strategies — Strategy Details options shown in the video — provides various weighting factors for determine the resulting distribution key and sort columns keys for tables. We found that the default settings work very well and very seldom need modification for coming up with the suggestions in the Key Management dialog. Likewise, choosing the first suggestion with the highest confidence value provided good results.

Before doing a full load of the data warehouse, we recommend doing a partial load of data and using a variety of tools such as:

Then, run the most frequently used queries against your Redshift database to examine query performance. The Amazon Redshift topic, Evaluating the Query Plan provides great tips on how to identify issues with queries so that you can modify your tables with better distribution and sort keys before doing a full data load.

Stay tuned to our blog for more tips and insights around migrating databases and data warehouses using the AWS Schema Conversion Tool.