Tags: Data Privacy, GDPR
One of our customers, a global digital media company, came to us with an interesting problem. As part of their General Data Protection Regulation (GDPR) compliance program, our client identified two tables that contained hundreds of millions of rows of historical user login data. GDPR states that you can process personal data while there is a contractual obligation with your client. However, once you terminate the contractual obligation and you have no legitimate interest with the data, deleting historical data protects against potential fines from the European Union (EU).
The first table with 800 million records included login data since the company had started. However, the client never really processed much of the data. To remedy this, we simply truncated the table, dropped the table from the database, and then removed the code that transferred the login records from the transaction table.
The second table, called SiteActivity.UserLogins, presented an interesting challenge. This table contained 560 million rows, of which 500 million rows were older than their privacy statement limit of 90 days. Since GDPR dictates that you must store data for the shortest time possible, 500 million records were in violation of the GDPR regulation and our client was vulnerable to significant fines. To make matters worse, every time a user logged into their system, their system wrote a record to the UserLogins table.
Problems deleting historical data from a large highly concurrent database
So, what could possibly go wrong with the following simple SQL statement?
1 | DELETE FROM SiteActivity.UserLogins WHERE LoginDate <= GETDATE() - 90 |
Essentially, SQL Server always acquires an exclusive lock on the table and holds onto it until it completes the transaction. This means that no one can actually log into their system until SQL Server completes the deletion of 500 million rows. In addition, the SQL Server fully logs the action in the transaction log. To make the problem even more interesting, the database participated in an Always On Availability Group cross-region replication on AWS with EC2 instances. In essence, the application would die a certain death as EBS drives supporting the database for the primary and secondary replicas exceed their IOPS thresholds!
As part of their research on the topic, our developers came across a blog post titled Deleting Historical Data from a Large Highly Concurrent SQL Server Database Table that looked promising. The author pointed out, deleting row by row using a WHERE clause with the primary key avoids blocking issues. SQL Server processes the delete statement very quickly and avoids the long blocking problem. The proposed solution by the author generates a large SQL script that SQLCMD executes that contains one DELETE statement per row.
Since we had a pre-production environment, our development team gave the approach a shot. After an hour, it became clear that the author never tested this approach with 500 million delete statements. Our team estimated it would take almost 2 days to complete the operation. To make matters worse, our team still needed to create a new non-clustered index for the remaining 60 million rows. This could have resulted in blocking issues again for new transactions.
Another approach for deleting historical data
One of our solution architects proposed a novel solution that could potentially avoid the blocking problem, limit extra disk activity, and reduce network traffic between the primary and secondary replica servers.
The solution goes something like this:
- Create a clone _Temp table with the exact same structure, including the column names, data types, and foreign key constraints.
- For this empty _Temp table, create the indexes identical to the original table and then add the brand new non-clustered indexes (per requirement).
- Run an insert with a select statement, so as to insert the last 90 days’ worth of data from the current table into the new _Temp table.
- Rename the tables to switch the places. The current production table becomes the _Old table and the _Temp table becomes an active production table.
- Take the delta data that came into the _Old table while we were renaming it and insert it to the new production table.
By using this approach, inserting 500 million records into the _Temp table took us 30 minutes. Since we created all of the indexes on the empty table, we were in business as soon as we completed with the insert statement. We experienced a brief hiccup during the rename sequence. For one second during the rename, there was one failure that took place. However, the application and retry logic and all the transactions made it into the system. As soon as we renamed the _Temp table to the original table name it became active and started accepting new inserts from the application. The only thing we had left was the data that found its way to the production table after the inserts were complete, but before the tables were renamed. That data we inserted into the new active table and we were all done.
A closer look at the procedure
FYI, we suggest you dive deeper into the technical process. We prepared the step-by-step guide with the explicit code samples of the second scenario that proved to be the most efficient on our side. So follow the White Rabbit.
1. Create the stored procedures copying data to temp and from old tables
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | CREATE PROCEDURE [dbo].[PreRenameMigration_UserLogins] @StartDateTime datetime ,@EndDateTime datetime AS BEGIN /*------------------------------------------------------------------------------ PROCEDURE: PreRenameMigration_UserLogins PURPOSE: To insert existing records from SiteActivity.UserLogins to what will be the new table, SiteActivity.UserLogins_Temp EXAMPLE: DECLARE @EndDate datetime = GETDATE() Exec PreRenameMigration_UserLogins @StartDateTime = '1900-01-01 00:00:1.000', @EndDateTime=@EndDate -----------------------------------------------------------------------------*/ SELECT @StartDateTime AS 'Start Date', @EndDateTime AS 'End Date\Start Date for Post inserts' INSERT INTO SiteActivity.UserLogins_Temp SELECT * FROM SiteActivity.UserLogins WHERE LoginDate BETWEEN @StartDateTime AND @EndDateTime END GO CREATE PROCEDURE [dbo].[PostRenameMigration_UserLogins] @StartDateTime datetime ,@EndDateTime datetime AS BEGIN /*------------------------------------------------------------------------------ PROCEDURE: PostRenameMigration_UserLogins PURPOSE: To insert existing records from the now renamed SiteActivity.UserLogins_old to the new renamed table, SiteActivity.UserLogins EXAMPLE: Exec PostRenameMigration_UserLogins @StartDateTime = '2018-09-26 13:00:00.000', @EndDateTime='2018-09-26 13:32:00.000' -----------------------------------------------------------------------------*/ SELECT @StartDateTime AS 'Start Date', @EndDateTime AS 'End Date' INSERT INTO SiteActivity.UserLogins SELECT * FROM SiteActivity.UserLogins_old WHERE LoginDate BETWEEN @StartDateTime AND @EndDateTime END GO |
2. Create a SiteActivity.UserLogins_Temp table with the right constraints and indexes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | CREATE TABLE [SiteActivity].[UserLogins_Temp]( [LoginDate] [datetime] NOT NULL, [SiteID] [dbo].[Identifier_ty] NOT NULL, [UserName] [nvarchar](40) NOT NULL, [IpAddress] [nvarchar](50) NOT NULL, [Success] [bit] NOT NULL, [SessionID] [nvarchar](64) NULL, [ResultStatusID] [INT] NULL, [AdminId] [INT] NULL, [AdminReason] [nvarchar](200) NULL ) GO ALTER TABLE [SiteActivity].[UserLogins_Temp] WITH CHECK ADD CONSTRAINT [SiteActivity.UserLogins_Temp_ResultStatusID_FK] FOREIGN KEY([ResultStatusID]) REFERENCES [dbo].[STATUS] ([iStatusID]) GO ALTER TABLE [SiteActivity].[UserLogins_Temp] CHECK CONSTRAINT [SiteActivity.UserLogins_Temp_ResultStatusID_FK] GO /* Clustered Index */ /****** Object: Index [IX_LoginDate] Script Date: 9/25/2018 4:51:45 PM ******/ CREATE CLUSTERED INDEX [IX_LoginDate] ON [SiteActivity].[UserLogins_Temp] ( [LoginDate] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) GO /* First non-clustered index */ /****** Object: Index [IX_LoginDate_Site_UserName] Script Date: 9/25/2018 4:53:13 PM ******/ CREATE NONCLUSTERED INDEX [IX_LoginDate_Site_UserName] ON [SiteActivity].[UserLogins_Temp] ( [LoginDate] ASC, [SiteID] ASC, [UserName] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) GO /* Second non-clustered index */ /****** Object: Index [IX_LoginDate_UserName] Script Date: 9/25/2018 4:53:39 PM ******/ CREATE NONCLUSTERED INDEX [IX_LoginDate_UserName] ON [SiteActivity].[UserLogins_Temp] ( [LoginDate] ASC, [UserName] ASC )WITH (PAD_INDEX = ON, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) GO /* Third non-clustered index */ /****** Object: Index [IX_UserName_LoginDate] Script Date: 9/25/2018 4:53:39 PM ******/ CREATE NONCLUSTERED INDEX [IX_UserName_LoginDate] ON [SiteActivity].[UserLogins_Temp] ( [UserName] ASC, [LoginDate] ASC )WITH (PAD_INDEX = ON, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) GO |
3. Grant permissions to SiteActivity.UserLogins_Temp.
1 2 3 4 5 | GRANT INSERT ON object::SiteActivity.UserLogins_Temp TO role_oneuser GRANT REFERENCES ON object::SiteActivity.UserLogins_Temp TO role_oneuser GRANT SELECT ON object::SiteActivity.UserLogins_Temp TO role_oneuser GRANT UPDATE ON object::SiteActivity.UserLogins_Temp TO role_oneuser GO |
4. Run insert with the select via proc to insert last 90 days data in SiteActivity.UserLogins_Temp.
1 2 3 4 | DECLARE @StartDate datetime = GETUTCDATE() - 90 ,@EndDate datetime = GETUTCDATE() EXEC PreRenameMigration_UserLogins @StartDateTime = @StartDate , @EndDateTime = @EndDate |
5. Monitor the Availability Group replication traffic, the size of WCDS transaction log, and the Splunk IOPs usage.
This process was important to make sure we didn’t fill the transaction log and ensure that they stayed within their IOPS limits with io1 storage.
6. Rename the UserLogins tables. Run the following program one line at a time.
1 2 3 4 | EXEC sp_rename [SiteActivity.UserLogins], [UserLogins_Old] GO EXEC sp_rename [SiteActivity.UserLogins_Temp], [UserLogins] GO |
7. Verify that the new records are getting inserted into SiteActivity.UserLogins via the service.
1 2 3 4 5 6 | -- This will provide the enddate for proc PostRenameMigration_UserLogins SELECT logindate, COUNT(*) FROM siteactivity.userlogins WHERE LoginDate > '' -- Coming back from the output of proc PreRenameMigration_UserLogins. GROUP BY logindate --having count(*) > 10 ORDER BY LoginDate |
8. Run insert with the select via proc to insert delta data in SiteActivity.UserLogins which might have been missed during the table renames.
1 2 3 4 5 | DECLARE @StartDate datetime = '' -- Coming back from the PreEndDate from the output of proc PreRenameMigration_UserLogins. ,@EndDate datetime = '' -- Coming from the first date from script Check_Service_Integrity.sql. EXEC PostRenameMigration_UserLogins @StartDateTime = @StartDate, @EndDateTime = @EndDate |
9. Drop all the stored procedures from the database now that the archive is complete
Like any good DBA, you want to remove the stored procedures you no longer need.
Becoming a perfect solution
Based upon the smooth experience in pre-production, we went ahead with the complex approach in production. Just in case, we ran the transaction log cleanup SqlAgent job every 2 minutes, as opposed to every 15 minutes, while importing the data into the _Temp table. Thus we were sure that the database transaction log would not fill up.
To make sure they don’t have the problem again, they run a SQL Agent job every day during their maintenance cycle to use the simple DELETE statement to remove the much smaller amount of data.
This is how the DB Best team achieved SQL Server Database GDPR compliance by deleting millions of rows of customer data, with ZERO downtime, or business processes disruption.
Though it is a story of just one out of hundreds of our successful collaborations. Stay tuned to know about all of them!