To see those method lets take some example data. It is City data. I have simplified this data to contain only few columns. Use below script for sample data.
CREATE TABLE City(CityName VARCHAR(100), CityCode VARCHAR(100), County VARCHAR(100), Country VARCHAR(100), CreateDate DATETIME)
INSERT INTO City(CityName, CityCode, County, Country)
VALUES ('London', 'LN', 'n/a', 'England', '2013-02-25 15:08:00')
,('London', 'LN', 'n/a', 'England', '2013-02-25 15:09:00')
,('Liverpool', 'LP', 'Mersyside', 'England', '2013-02-25 15:10:00')
,('Liverpool', 'LP', 'Mersyside', 'England', '2013-02-25 15:11:00')
,('Liverpool', 'LP', 'Mersyside', 'England', '2013-02-25 15:12:00')
,('Oxford', 'OX', 'Oxfordshire', 'England', '2013-02-25 15:13:00')
,('Oxford', 'OX', 'Oxfordshire', 'England', '2013-02-25 15:14:00')
,('Oxford', 'OX', 'Oxfordshire', 'England', '2013-02-25 15:15:00')
,('Oxford', 'OX', 'Oxfordshire', 'England', '2013-02-25 15:16:00')
,('Glasgow', 'GL', 'n/a', 'Scotland', '2013-02-25 15:17:00')
,('Glasgow', 'GL', 'n/a', 'Scotland', '2013-02-25 15:18:00')
Now lets see the methods to de-duplicate these data.
1) Most simple method is to load data in staging table and use DISTINCT to select the unique records.
CREATE TABLE City_temp(CityName VARCHAR(100), CityCode VARCHAR(100), County VARCHAR(100), Country VARCHAR(100), CreateDate DATETIME)
INSERT INTO City_temp(CityName, CityCode, County, Country)
SELECT DISTINCT CityName, CityCode, County, Country FROM City
TRUNCATE TABLE City
INSERT INTO City(CityName, CityCode, County, Country, CreateDate)
SELECT CityName, CityCode, County, Country, GETDATE() FROM City_temp
INSERT INTO City_temp(CityName, CityCode, County, Country, CreateDate)
SELECT CityName, CityCode, County, Country, MAX(CreateDate) AS CreateDate FROM City GROUP BY CityName, CityCode, County, Country
2) Use temp table with Id column with IDENTITY and select records with MIN/MAX value of Id
CREATE TABLE City_temp(Id INT IDENTITY(1,21), CityName VARCHAR(100), CityCode VARCHAR(100), County VARCHAR(100), Country VARCHAR(100), CreateDate DATETIME)
INSERT INTO City_temp(CityName, CityCode, County, Country, CreateDate)
SELECT CityName, CityCode, County, Country, CreateDate FROM City
SELECT CityName, CityCode, County, Country, CreateDate
FROM City_temp AS A
INNER JOIN (SELECT MAX(Id) AS Id FROM City_temp GROUP BY CityName, CityCode, County, Country) AS B
ON A.Id = B.Id
;WITH T
AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY CityName, CityCode, County, Country ORDER BY CityName, CityCode, County, Country) AS RowIndex
FROM City )
DELETE FROM T WHERE RowIndex > 1
This method is fast of above 2 methods becuase it does all processing in memory there is no need to wirte data to temp table. Please not here that processing in memory only happen if SQL server has enough memory to process these data, otherwise internally it wirtes data to tempdb.
I hope this will be useful. Please feel free to comment if you find any other more effective method to remove duplicate records in SQL server.
No comments:
Post a Comment