Blog: Loading Amazon Redshift Data Utilizing AWS Glue ETL service

Your CIO made a strategic decision to move to AWS and you are wondering how to move your Enterprise Data Warehouse (EDW) on Teradata: should you opt for using Teradata Software Tiers on AWS Marketplace with EC2 instances, or consider migrating to Amazon Redshift? The latter option looks interesting, but there’s one big issue: Redshift doesn’t support stored procedures.  You’ve got a plenty of procedural code as part of your Extract-Transform-Load (ETL) processes, but it is not clear how to move it to Redshift. In this article, we’ll look into how regular data loading jobs can be moved to Redshift using AWS Glue ETL service on a regular basis.  Our specific focus is on migrating stored procedure code of Teradata ETL to AWS Glue scripts.

AWS Glue

For the sake of this exercise we chose a following typical scenario from our Teradata ETL:

  • Source data: the file is dropped onto FTP server;
  • Transformations: filter, join, calculation, aggregation data from datasets;
  • Target: load resulting dataset inserted into Redshift table.

We’ll try to build the same scenario on AWS Glue ETL service to see whether it can be a workable solution or not.

Continue reading to see how we approached the loading Amazon Redshift data utilizing AWS Glue and converted the source ETL scripts to the new format.

Overview of the source ETL script

Our source Teradata ETL script loads data from the file located on the FTP server, to the staging area. After this, we use the stored procedures to transform the data and then ingest it into the data mart.

You can see the Teradata ETL workflow on the top of the following diagram.

AWS Glue ETL service

Let’s try to reproduce the same operations in AWS Glue.

General approach to converting Teradata ETL script to AWS Glue ETL service

We start with the data extraction phase. In Teradata ETL script we started with the bulk data loading. In AWS Glue ETL service, we run a Crawler to populate the AWS Glue Data Catalog table. Of course, we can run the crawler after we created the database. Also, you may consider using Glue API in your application to upload data into the AWS Glue Data Catalog.

After that, we can move the data from the Amazon S3 bucket to the Glue Data Catalog. Then, we use the Glue job, which leverages the Apache Spark Python API (pySpark), to transform the data from the Glue Data Catalog. This Glue job helps us move the transformed data to Amazon Redshift data warehouse.

Specific solutions on how to move Teradata to AWS Glue

Before running the abovementioned crawler, we need to configure the connection to the Amazon S3 bucket. You need to place the source file into the bucket, located in the same region, where you have created the Redshift and Glue instances. The crawler loads metadata to the data catalog, acting as the replacement for the Teradata Batch Teradata Query (BTEQ) script.

The Glue job replicates the logic of the stored procedure. Generally, the script of the Glue job has the following structure.

  1. Initialize pySpark modules and the Glue job.
  2. Create a dynamic frame from the staging table in the Glue Data Catalog.
  3. Transform the data from the Glue Data Catalog.
  4. Move the data into the Redshift tables.

The original Teradata ETL scripts include the following code:

1
2
3
4
5
6
7
8
9
10
11
12
DELETE td_glue_test.ProductPrice;
INSERT INTO td_glue_test.ProductPrice
SELECT ProductID, SUM(UnitPrice-UnitPriceDiscount)/COUNT(ProductID) ProductAvgPrice
FROM td_glue_test.SalesOrderDetail
GROUP BY 1;

DELETE td_glue_test.SalesOrderTtl;
INSERT INTO td_glue_test.SalesOrderTtl
SELECT SalesOrderID, ModifiedDate,
SUM((UnitPrice-UnitPriceDiscount) * OrderQty) OrderPrice
FROM td_glue_test.SalesOrderDetail
GROUP BY 1,2;

We used the AWS Glue DynamicFrame class which is a specialization of the Apache Spark DataDrame class to transform the data from the Glue Data Catalog into the Amazon Redshift format. Check the image below to see how we did that.

1
2
3
4
5
6
7
8
9
10
11
12
dataframe0 = DynamicFrame.toDF(datasource0)
dataframe0 = dataframe0.withColumn("modifieddate",
unix_timestamp(dataframe0["modifieddate"], "dd.MM.yyyy hh:mm:ss.SSS")
.cast("double")
.cast("timestamp"))
dataframe0.createOrReplaceTempView("TempTable")
dataframe0_sql_df_dim = spark.sql("SELECT productid,
SUM(unitprice-unitpricediscount)/COUNT(productid) productavgprice
FROM TempTable GROUP BY productid"
)
dataframe0_sql_df_fact = spark.sql("SELECT salesorderid, modifieddate,
SUM((unitprice-unitpricediscount) * orderqty) orderprice
FROM TempTable GROUP BY salesorderid, modifieddate"
)

In this way, we can use AWS Glue as an environment to run SQL scripts against Redshift.

Moving data to Redshift

The following code fragment of the AWS Glue job stands for ingesting data into the Redshift tables.

1
2
3
4
5
6
7
8
9
10
11
12
datasource1 = DynamicFrame.fromDF(dataframe0_sql_df_dim, glueContext, "datasource1")
applymapping2 = ApplyMapping.apply(frame = datasource1, mappings =
[("productid", "int", "productid", "int"),("productavgprice", "double",
"productavgprice", "double")], transformation_ctx = "applymapping2")
resolvechoice3 = ResolveChoice.apply(frame = applymapping2, choice = "make_cols",
transformation_ctx = "resolvechoice3")
dropnullfields4 = DropNullFields.apply(frame = resolvechoice3,
transformation_ctx = "dropnullfields4")
datasink5 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields4,
catalog_connection = "Redshift-dba_dev-rsdbb02", connection_options =
{"schema": "dba_dev", "dbtable": "productprice", "database": "dev"},
redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink5")

The following video shows how our conversion approach for this Teradata ETL script works.

Validating the data transfer

After performing all these operations, we need to validate the data transfer. We can do this using our in-house Database Compare Suite tool, which simplifies schema and data manipulation tasks in both homogeneous and heterogeneous database environments. After we connect to the source Teradata and target Amazon Redshift databases, we select the identical tables to opt for Data Comparison operation. Then the Database Compare Suite checks the data in selected tables and reports any differences between them.

Migrating your workloads to the Amazon cloud, you should consider leveraging a fully managed AWS Glue ETL service to prepare and load your data into the data warehouse. We can help you craft an ultimate ETL solution for your analytic system, migrating your existing ETL scripts to AWS Glue. Contact DB Best to learn more about our exclusive cloud offers.