all things web and not web

allskonar

It’s not uncommon that databases are filled with duplicate data sets. And it’s even more common that you want to clear it out so you don’t have to bothered with duplicates. This is obviously the sort of thing you want to fix at the entrance, on the way in to the database and not collect duplicates, but this solution works well if that is too late.

To get things started let’s write up a file that we can play with. I am using MySQL and I don’t know the SQL standard well enough to say but I am sure this example can be modified to work on other database servers. I generally just keep a file open and then retrieve it on the mysql console. Download the example file.


mysql>  \. dup_tutorial.sql

Creating a simple table to play with. And stuffing it with some data.

CREATE TABLE peoples (
  id int(11) NOT NULL auto_increment,
  f_name varchar(128) NOT NULL default '',
  l_name varchar(128) NOT NULL default '',
  age text NOT NULL,
  updated_on timestamp NOT NULL default
       CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
  PRIMARY KEY  (id)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

INSERT INTO peoples (f_name, l_name, age) 
   VALUES ('baldur', 'gudbjornsson', 28);
INSERT INTO peoples (f_name, l_name, age) 
   VALUES ('baldur', 'gudbjornsson', 28);
INSERT INTO peoples (f_name, l_name, age) 
   VALUES ('baldur', 'gudbjornsson', 28);
INSERT INTO peoples (f_name, l_name, age) 
   VALUES ('baldur', 'gudbjornsson', 31);

In this sample we have 4 rows and but there are only 2 rows unique if updated_on and id fields are not taken into the account. Now the way to query this to only receive the unique set group by can be used or SELECT DISTINCT.


SELECT *
FROM peoples 
GROUP BY f_name, l_name, age

And here is the trick to collect the others to dispose of them. First we’ll join the peoples table with a unique rows from it self in a sub query to retrieve the ones that don’t match.


SELECT o_peoples.id 
FROM peoples o_peoples
LEFT OUTER JOIN 
  (SELECT id 
   FROM peoples 
   GROUP BY f_name, l_name, age) u_peoples 
ON o_peoples.id = u_peoples.id
WHERE u_peoples.id IS NULL;

Then we’ll delete those extra rows from the database. Do this at your own risk and obviously back up your database five times and have a quick recovery plan it things go south.


DELETE o_peoples 
FROM peoples o_peoples
LEFT OUTER JOIN 
  (SELECT Id FROM peoples 
   GROUP BY f_name, l_name, age) u_peoples
ON o_peoples.id = u_peoples.id
WHERE u_peoples.id IS NULL;

Leave a Reply

 
allskonar Powered by Mephisto