Monday, 25 February 2013

Delete duplicate records from MS SQL table using CTE and window function

I work for data company which receives data from many third party data providers. These data often comes with duplicate records. so while I import them to our database, I ofter has to run de-duplication on those data. I have tried many de-deuplicate methods. Here I would like to share those methods.

To see those method lets take some example data. It is City data. I have simplified this data to contain only few columns.  Use below script for sample data.

CREATE TABLE City(CityName VARCHAR(100), CityCode VARCHAR(100), County VARCHAR(100), Country VARCHAR(100), CreateDate DATETIME)
       
INSERT INTO City(CityName, CityCode, County, Country)
      VALUES ('London', 'LN', 'n/a', 'England', '2013-02-25 15:08:00')
                  ,('London', 'LN', 'n/a', 'England', '2013-02-25 15:09:00')
                  ,('Liverpool', 'LP', 'Mersyside', 'England', '2013-02-25 15:10:00')
                  ,('Liverpool', 'LP', 'Mersyside', 'England', '2013-02-25 15:11:00')
                  ,('Liverpool', 'LP', 'Mersyside', 'England', '2013-02-25 15:12:00')
                  ,('Oxford', 'OX', 'Oxfordshire', 'England', '2013-02-25 15:13:00')
                  ,('Oxford', 'OX', 'Oxfordshire', 'England', '2013-02-25 15:14:00')
                  ,('Oxford', 'OX', 'Oxfordshire', 'England', '2013-02-25 15:15:00')
                  ,('Oxford', 'OX', 'Oxfordshire', 'England', '2013-02-25 15:16:00')
                  ,('Glasgow', 'GL', 'n/a', 'Scotland', '2013-02-25 15:17:00')
                  ,('Glasgow', 'GL', 'n/a', 'Scotland', '2013-02-25 15:18:00')


Now lets see the methods to de-duplicate these data.


1) Most simple method is to load data in staging table and use DISTINCT to select the unique records.
CREATE TABLE City_temp(CityName VARCHAR(100), CityCode VARCHAR(100), County VARCHAR(100), Country VARCHAR(100), CreateDate DATETIME)

INSERT INTO City_temp(CityName, CityCode, County, Country)
SELECT DISTINCT CityName, CityCode, County, Country FROM City

TRUNCATE TABLE City

INSERT INTO City(CityName, CityCode, County, Country, CreateDate)
SELECT CityName, CityCode, County, Country, GETDATE() FROM City_temp
  
 These will work without CreateDate field, but if we have to select the createDate also then you have to use aggregate function like MIN/MAX.

INSERT INTO City_temp(CityName, CityCode, County, Country, CreateDate)
SELECT CityName, CityCode, County, Country, MAX(CreateDate) AS CreateDate FROM City GROUP BY CityName, CityCode, County, Country



2) Use temp table with Id column with IDENTITY and select records with MIN/MAX value of Id

     
CREATE TABLE City_temp(Id INT IDENTITY(1,21), CityName VARCHAR(100), CityCode VARCHAR(100), County VARCHAR(100), Country VARCHAR(100), CreateDate DATETIME)

INSERT INTO City_temp(CityName, CityCode, County, Country, CreateDate)
SELECT CityName, CityCode, County, Country, CreateDate FROM City

TRUNCATE TABLE City
INSERT INTO City(CityName, CityCode, County, Country, CreateDate)
SELECT CityName, CityCode, County, Country, CreateDate
FROM City_temp  AS A
      INNER JOIN (SELECT MAX(Id) AS Id FROM City_temp GROUP BY CityName, CityCode, County, Country) AS B
            ON A.Id = B.Id



3) This is the most effective method so far I found, using ETC and ROW_NUMBER() window function. Use below query to delete the duplicate records.

;WITH T
AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY CityName, CityCode, County, Country ORDER BY CityName, CityCode, County, Country) AS RowIndex
FROM City )

DELETE FROM T WHERE RowIndex > 1




This method is fast of above 2 methods becuase it does all processing in memory there is no need to wirte data to temp table.  Please not here that processing in memory only happen if SQL server has enough memory to process these data, otherwise internally it wirtes data to tempdb.

 I hope this will be useful. Please feel free to comment if you find any other more effective method to remove duplicate records in SQL server.


No comments:

Post a Comment