This became an issue again recently because the UNIX admin where I work was creating a database to store system statistics. He has daily flat files that he wants to load and knew the basics of creating a database. As he was loading the data it occurred to him that he did not want to be able to load the same file twice and he decided to try to add a unique index. Funny thing was that what he was trying to prevent had already happened and he received an error trying to create the unique index. His next stop was my desk to ask why the error was happening and how to fix it.
The first thing we needed to do was to figure out which rows were duplicated. The following SQL gave us the date that was loaded twice:
select collection_date, server, count(*)
from system_stats
group by collection_date, server
having count(*) > 1;
This gave us a list of machines for one date with 2 rows each. Now we knew what date was loaded twice, but since they were exactly the same, how do we get rid of one row for each key? This took creating a temporary table to hold the records we didn't want to delete.
create table hold_records as
select distinct * from system_stats
where collection_date =
Checking the hold_records table shows us that we have exactly half the number of records for that date and each server only has one record for that date. Now we clean up the data for that date from the main table.
delete system_stats where collection_date =
Now we put the data back:
insert into system_stats (select * from hold_records);
Voila! One record per server for that date and the unique index could be created. Again, this is not really rocket science, but I am surprised at how many times I have been asked to do something similar. Hopefully it will save someone else some time in the future.