Extract and Copy Millions of Zip Files using Azure Data Factory

13 May 2019 Prashant Mishra Microsoft Azure

A customer of ours presented us with an interesting problem. How do you take a hard drive with four terabytes of historical data files that have millions of zipped data files nested by year, quarter, and month and move them into an Azure Data Lake Gen2 storage account? Getting the data from the hard drive to Azure was easy using the Azure Data Box service. Our challenge at the time of this blog post was that Powershell had limited functionality accessing Gen2 storage accounts. The good news, Azure Data Factory is fully supported. In this blog post, I’ll show you how I created an Azure Data Factory Pipeline to recursively unzip the files on an Azure storage account and load them into the desired location on an Azure Data Lake Gen2 storage account.

To simplify the blog post, I’m assuming all the zipped files are already in an Azure Data Lake Gen2 storage account. To discover the data visually, I’m also using version 1.8.1 of the Azure Storage Explorer.

Step 1: Recursively extracting zipped data files into a flat directory structure

In this step, I’ll show you how to extract the zipped data files into a flat directory structure.

First, log in to Azure Portal, create Azure Data Factory, and go to Data Factory portal.

Click on ‘Copy Data’ on Azure Data Factory portal. Provide an appropriate name for your Data Factory pipeline.

You can choose to run it just once or you can schedule this pipeline as per the requirement. In this example, we are running this pipeline only once.

From there, you need to create a new Linked Service to create a source for the pipeline. Search for Data Lake Gen 2 in the search box and fill in the details to create Linked Service for source.

Now, select the Linked Service that you created, to set the connection for source. Browse to the path where zip files are stored in Data Lake. In this example, all files are kept under the Zipped folder in the archivedemo container.

Select ZipDeflate in compression to unzip all zipped files in the source folder. You can leave the other options alone.

Next, we need to create the Linked Service for Destination but in our example, we are using the same Data Lake account for source and destination. So, we can use the existing Linked Service for destination connection. Click the Next button.

Browse to the destination path, where your unzipped files should be stored after successful pipeline execution. Click the Next button to complete creating the pipeline.

Click the Monitor button to monitor the pipeline execution.

You can see the result in storage explorer under the Unzipped folder in the archivedemo container.

Results

At this pont

Step 2:

We have kept two files in the Zipped folder as a data source.

So, we log in to Azure Portal, create Azure Data Factory and go to Data Factory portal.

We click the Create Pipeline button in the Azure Data Factory portal. Then we click on the ‘+’ symbol to create a new pipeline. We provide an appropriate name for your Data Factory pipeline and add MetaData activity in the pipeline.

We create a new data source pointing to the path where your zipped files are stored, if you already have an existing data source, you can just select it from the drop-down list. Select Child Items in the argument section of the dataset.

Execute this pipeline to validate whether your metadata activity is working correctly. You can view the resulting data under actions, in the Output section.

Adding Foreach activity in the pipeline

From there, we add Foreach activity in the pipeline to iterate through all the zip files. We also add the following expression in the Items field. GetZipMetaData is the name of Metadata activity that we added to the pipeline. You can modify the below-mentioned script accordingly.

1	@activity(‘GetZipMetaData’).output.childItems

Click on Activities beside Settings to add Copy Data activity within Foreach activity. Then click on +New in Source section to create a new dataset.

Using the Linked Service

Select the Linked Service if it already exists. Otherwise, create a new Linked Service, pointing to the data source where your zip files are residing. In this example, zip files are stored in a zipped file under archivedemo container, so ‘archivedemo/Zipped’ is mentioned in File Path.

You need to add @item().name expression to get all the file names from Metadata activity. Select ZipDeflate as compression type to decompress all the zip files. Leave the other fields unchanged.

Click on +New in Sink section to create a new dataset. Select the Linked Service if it already exists. Otherwise, create a new Linked Service, pointing to the data source where your zip files are residing.

In this example, decompressed files are supposed to be stored in Unzipped file under the archivedemo container, so we should add ‘archivedemo/Unzipped’ in File Path but we have made this path dynamic based on the file’s name. If the file name contains 2016, this file must go to archivedemo/Unzipped/Year=2016/ path and same goes for other files.

We used the expression below to evaluate the path for file based on its name. However, this logic contains expressions only for the Year 2016. So, you can add the same logic for different years or months in a similar way.

1	@{IF(CONTAINS(item().name,'2016'),CONCAT('Year=2016/',item().name),'false')}

We leave other fields unchanged.

Then we select Flatten Hierarchy from the drop-down list of Copy Behavior under Sink section.

Now, we publish all the changes for this pipeline. Then we run the pipeline and go to the Output section to monitor the pipeline execution.

You can validate the pipeline execution by checking the unzipped files in archivedemo/Unzipped/Year=2016 file path in Azure Data Lake Gen 2.

Results

Conclusion

You can use Azure Data Factory for extracting all zip files in the cloud and move the extracted unzipped files to the cloud or different storage without bringing those files offline.

This saves your manual effort to extract files and locate those files under correct hierarchy based on their filenames.

In this example, we have used only a couple of files for demonstration, but we have used this method for moving millions of files to the cloud. So, you can trust this solution and save time.

Blog: Extract and Copy Millions of Zip Files on Azure Data Lake Storage Gen2 using Azure Data Factory