It’s not uncommon that databases are filled with duplicate data sets. And it’s even more common that you want to clear it out so you don’t have to bothered with duplicates. This is obviously the sort of thing you want to fix at the entrance, on the way in to the database and not collect duplicates, but this solution works well if that is too late.
To get things started let’s write up a file that we can play with. I am using MySQL and I don’t know the SQL standard well enough to say but I am sure this example can be modified to work on other database servers. I generally just keep a file open and then retrieve it on the mysql console. Download the example file.
mysql> \. dup_tutorial.sql
Creating a simple table to play with. And stuffing it with some data.
CREATE TABLE peoples (
id int(11) NOT NULL auto_increment,
f_name varchar(128) NOT NULL default '',
l_name varchar(128) NOT NULL default '',
age text NOT NULL,
updated_on timestamp NOT NULL default
CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO peoples (f_name, l_name, age)
VALUES ('baldur', 'gudbjornsson', 28);
INSERT INTO peoples (f_name, l_name, age)
VALUES ('baldur', 'gudbjornsson', 28);
INSERT INTO peoples (f_name, l_name, age)
VALUES ('baldur', 'gudbjornsson', 28);
INSERT INTO peoples (f_name, l_name, age)
VALUES ('baldur', 'gudbjornsson', 31);
In this sample we have 4 rows and but there are only 2 rows unique if updated_on and id fields are not taken into the account. Now the way to query this to only receive the unique set group by can be used or SELECT DISTINCT.
SELECT *
FROM peoples
GROUP BY f_name, l_name, age
And here is the trick to collect the others to dispose of them. First we’ll join the peoples table with a unique rows from it self in a sub query to retrieve the ones that don’t match.
SELECT o_peoples.id
FROM peoples o_peoples
LEFT OUTER JOIN
(SELECT id
FROM peoples
GROUP BY f_name, l_name, age) u_peoples
ON o_peoples.id = u_peoples.id
WHERE u_peoples.id IS NULL;
Then we’ll delete those extra rows from the database. Do this at your own risk and obviously back up your database five times and have a quick recovery plan it things go south.
DELETE o_peoples
FROM peoples o_peoples
LEFT OUTER JOIN
(SELECT Id FROM peoples
GROUP BY f_name, l_name, age) u_peoples
ON o_peoples.id = u_peoples.id
WHERE u_peoples.id IS NULL;
Leave a Reply