Since I’ ve had to look this up twice in the last year, I should blog about it so I’ll remember. This is the quickest solution I’ve found.
You have duplicate rows in your database. Even if you have an
AUTOINCREMENT primary key, that doesn’t necessarily prevent you from having rows with identical information.
Find a column that will uniquely identify a record. If there isn’t one column, find a combination of columns. If there isn’t a combination of columns that uniquely identify your row, you probably have database design issues.
CREATE TEMPORARY TABLE nonduplicates LIKE table_with_duplicates;
INSERT INTO nonduplicates SELECT * FROM table_with_duplicates GROUP BY unique_column;
To use a combination of columns, do:
INSERT INTO nonduplicates SELECT * FROM table_with_duplicates GROUP BY unique_column_1, unique_column_2, etc...;
If you have a very large data set, this technique is going to take a lot of time and resources. It may not even be possible due to memory constraints. In that case, here is an article on analyzing duplicate rows without using temporary tables. In essence, you
JOIN a table to itself in a subquery that matches all of your unique_columns against themselves with a
HAVING clause that checks for an
id lower or higher than the
MIN(id) depending on whether or not you want to keep the older or newer records. You could also use a timestamp field for this if there is one. This allows you much finer control over which records you keep. And in the time it will take you to understand my explanation, you could have read the (commented!) SQL code in the link above and understood it much better.
Of the many languages I work in and have worked with, SQL is the most unintuitive to me. I ran across the above link while idly searching for links on SQL and set theory. I can think of functional languages in terms of mathematical functions, but I don’t have that mapping in my brain with SQL. I think it’s the syntax. It’s always felt so awkward and COBOL-ish to me. I guess I’m spoiled by Python.