- 1 1. Introduction
- 2 2. Basics: Detecting Duplicates Using Only Key Columns
- 3 3. Extracting All Records with Duplicate Keys
- 4 4. Duplicate Detection with Multiple Columns
- 5 5. Removing Duplicate Records (Deletion)
- 6 6. Performance Considerations and Index Strategies
- 7 7. Advanced Examples: Handling Complex Cases
- 8 8. Summary
- 9 9. FAQ: Common Questions About Extracting and Deleting Duplicate Data in MySQL
- 9.1 Q1. Why use GROUP BY + HAVING instead of DISTINCT?
- 9.2 Q2. Which should I use, IN or EXISTS?
- 9.3 Q3. How do I check for duplicates across multiple columns?
- 9.4 Q4. Error 1093 occurs in the DELETE statement. How do I handle it?
- 9.5 Q5. How can I safely delete duplicate data?
- 9.6 Q6. What are the measures when queries are slow due to large data volumes?
- 9.7 Q7. How can I fundamentally prevent duplicate registrations?
- 9.8 Q8. Can the same methods be used in MariaDB or other RDBMS?
1. Introduction
When operating a database, it’s not uncommon to encounter issues such as “the same data being registered multiple times” or “information that should be unique being duplicated.” Especially in environments using relational databases like MySQL, extracting and managing duplicate data is an essential task to maintain data accuracy and quality.
For example, in core tables of business systems such as member information, product data, and order history, there is a risk of the same data being registered twice due to user operation errors or system malfunctions. If left as is, it can lead to decreased accuracy in aggregation and analysis, or become the cause of unexpected bugs and troubles.
To solve such “duplicate data problems,” first extract which records are duplicated, and then organize and delete the duplicate records as appropriate. However, with only standard SELECT statements in MySQL, it’s often difficult to efficiently find “duplicated data,” requiring more advanced SQL writing techniques and practical tips.
In this article, taking “Methods to Extract Duplicate Data in MySQL” as the theme, we will explain in detail from basic SQL statements to practical application examples, considerations for performance, and handling common errors. We aim to provide information as practical as possible, useful not only for database beginners but also for engineers who write SQL daily on the job.
2. Basics: Detecting Duplicates Using Only Key Columns
The most basic method for extracting duplicate data in MySQL is to identify cases where multiple records share the same value in specific columns (key columns). This section explains representative SQL statements for extracting key values with duplicates, along with their execution examples.
2-1. Duplicate Extraction Using GROUP BY and HAVING Clauses
The basics of duplicate detection involve grouping by the columns specified in the GROUP BY
clause and then filtering with the HAVING
clause for groups that contain 2 or more records. The following is a typical example.
SELECT key_column, COUNT(*) AS duplicate_count
FROM table_name
GROUP BY key_column
HAVING COUNT(*) > 1;
Example: Extracting Duplicates in Member Email Addresses
SELECT email, COUNT(*) AS count
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
Running this query will display a list of those email addresses and the duplicate counts (count) for cases where the same email address is registered multiple times.
2-2. Duplicate Extraction for Multiple Columns Using GROUP BY
If you want to detect duplicates based on combinations of two or more columns, you can specify multiple columns in GROUP BY using the same approach.
SELECT col1, col2, COUNT(*) AS duplicate_count
FROM table_name
GROUP BY col1, col2
HAVING COUNT(*) > 1;
With this method, you can detect duplicate data where multiple conditions match exactly, such as “same name and date of birth” or “same product ID and order date.”
2-3. Grasping the Total Number of Duplicate Records
If you want to know the overall picture of how many duplicates there are, you can use a subquery to calculate the total number of duplicates.
SELECT SUM(duplicate_count) AS total_duplicates
FROM (
SELECT COUNT(*) AS duplicate_count
FROM table_name
GROUP BY key_column
HAVING COUNT(*) > 1
) AS duplicates;
This query sums the “excess” numbers for each duplicate group.
In this way, by combining GROUP BY
and HAVING
, you can extract duplicate data simply and efficiently in MySQL.
3. Extracting All Records with Duplicate Keys
In the previous chapter, we introduced a method to list only the “duplicate key values.” However, in real-world scenarios, it’s common to want to “examine all the contents of the records that are duplicating.” For example, to scrutinize the details of user information registered as duplicates or the contents of product data on an individual basis.
In this chapter, we will explain practical SQL patterns for extracting all records with duplicate keys.
3-1. Extracting Duplicate Records Using Subqueries
The most basic method is to obtain a list of duplicate key values using a subquery and then retrieve all matching records.
SELECT *
FROM table_name
WHERE key_column IN (
SELECT key_column
FROM table_name
GROUP BY key_column
HAVING COUNT(*) > 1
);
Example: Extracting All Records with Duplicate Email Addresses
SELECT *
FROM users
WHERE email IN (
SELECT email
FROM users
GROUP BY email
HAVING COUNT(*) > 1
);
Executing this query will extract all rows in the “users” table where email addresses are duplicated (including ID, registration date, etc.).
3-2. Efficient Extraction Using the EXISTS Clause
For large datasets or when focusing on performance, using the EXISTS
clause is also effective. IN
and EXISTS
are similar, but depending on the data volume and index conditions, one may perform faster in certain cases.
SELECT *
FROM table_name t1
WHERE EXISTS (
SELECT 1
FROM table_name t2
WHERE t1.key_column = t2.key_column
GROUP BY t2.key_column
HAVING COUNT(*) > 1
);
Example: Duplicate Email Records (EXISTS Clause)
SELECT *
FROM users u1
WHERE EXISTS (
SELECT 1
FROM users u2
WHERE u1.email = u2.email
GROUP BY u2.email
HAVING COUNT(*) > 1
);
3-3. Notes and Performance
- Subquery performance can be significantly affected with large amounts of data. If indexes are properly set, both
IN
andEXISTS
will operate at a practical level. - However, when determining duplicates with complex conditions or multiple columns, the SQL may become heavy, so be sure to verify operation in a test environment beforehand.
In this way, extracting all records that match duplicate keys can be achieved using subqueries or the EXISTS
clause.
4. Duplicate Detection with Multiple Columns
Duplicate data determination conditions are not necessarily limited to a single column. In practical work, there are frequent scenarios where you want to ensure uniqueness by combining multiple columns. For example, cases where you consider duplicates when multiple items such as “name + birthdate” or “product ID + color + size” are all the same.
In this chapter, we will explain in detail how to extract duplicates using multiple columns.
4-1. Specifying Multiple Columns in GROUP BY for Duplicate Detection
If you want to detect duplicates with multiple columns, specify the multiple columns separated by commas in the GROUP BY
clause. With HAVING COUNT(*) > 1
, you can extract only those combinations that exist two or more times.
SELECT col1, col2, COUNT(*) AS duplicate_count
FROM table_name
GROUP BY col1, col2
HAVING COUNT(*) > 1;
Example: Extracting Duplicates by “first_name” and “birthday”
SELECT first_name, birthday, COUNT(*) AS count
FROM users
GROUP BY first_name, birthday
HAVING COUNT(*) > 1;
Using this query, you can identify cases where combinations of the same name and same birthdate are registered multiple times.
4-2. Extracting All Duplicate Records for Multiple Columns
If you need all record details that have duplicate key combinations, use a subquery to extract the duplicate pairs and then extract all rows matching those combinations.
SELECT *
FROM table_name t1
WHERE (col1, col2) IN (
SELECT col1, col2
FROM table_name
GROUP BY col1, col2
HAVING COUNT(*) > 1
);
Example: All Records with Duplicates in “first_name” and “birthday”
SELECT *
FROM users u1
WHERE (first_name, birthday) IN (
SELECT first_name, birthday
FROM users
GROUP BY first_name, birthday
HAVING COUNT(*) > 1
);
With this query, for example, if the combination “Tarō Tanaka, 1990-01-01” is registered multiple times, all the detailed data can be obtained.
4-3. Duplicate Determination for Exact Matches (COUNT DISTINCT)
If you want to know how many rows have all multiple columns with exactly the same values, you can also aggregate using COUNT(DISTINCT ...)
.
SELECT COUNT(*) - COUNT(DISTINCT col1, col2) AS duplicate_count
FROM table_name;
This SQL provides an estimate of the number of complete duplicates in the table.
4-4. Precautions
- Even when detecting duplicates with multiple columns, properly setting indexes can greatly improve search speed.
- When there are many columns or when NULLs are included, unintended duplicate extractions may occur, so design the conditions carefully.
This way, duplicate detection and extraction with multiple columns can also be handled flexibly depending on SQL techniques.
5. Removing Duplicate Records (Deletion)
Once you can extract duplicate data, the next necessary step is “deleting unnecessary duplicate records.” In practice, the process of keeping only “one record from the duplicates and deleting the rest” is commonly performed. However, when automatically deleting duplicates in MySQL, to prevent unintended data loss, it is necessary to clearly narrow down the deletion targets.
This chapter explains in detail the representative methods for safely deleting duplicate data and the precautions to take during the process.
5-1. Deleting Duplicates Using Subqueries + DELETE
To keep only the “oldest” or “newest” among duplicate records and delete the others, a DELETE statement using a subquery is useful.
Example: Keep only the record with the smallest ID (oldest) and delete the rest
DELETE FROM users
WHERE id NOT IN (
SELECT MIN(id)
FROM users
GROUP BY email
);
This query keeps only the smallest id for each “email” (the record registered first) and deletes all other rows with the same email.
5-2. Avoiding MySQL-Specific Errors (Error 1093)
In MySQL, referencing the same table within a subquery during DELETE can cause Error 1093. In such cases, you can avoid this by wrapping the subquery result as a temporary table (derived table).
DELETE FROM users
WHERE id NOT IN (
SELECT * FROM (
SELECT MIN(id)
FROM users
GROUP BY email
) AS temp_ids
);
By wrapping the subquery further with SELECT * FROM (...) AS alias
like this, you can prevent the error and perform the deletion safely.
5-3. Deleting Duplicates for Multiple Columns
When you want to delete duplicates based on a combination of multiple columns, similarly use GROUP BY
on multiple columns and delete the other records.
Example: Delete all but the first record that duplicates on “first_name” and “birthday”
DELETE FROM users
WHERE id NOT IN (
SELECT * FROM (
SELECT MIN(id)
FROM users
GROUP BY first_name, birthday
) AS temp_ids
);

5-4. Safety Measures and Best Practices for Deletion
Duplicate deletion is a high-risk operation that can lead to data loss. Be sure to pay attention to the following points.
- Take a Backup: Always save a backup of the entire table or the target records before deletion.
- Use Transactions: If possible, utilize transaction features to allow immediate rollback of “incorrect deletions.”
- First Confirm the Number of Targets with SELECT: Make it a habit to extract and verify with a SELECT statement beforehand whether “the deletion targets are as expected.”
- Check Indexes: Adding indexes to the columns used for duplicate detection improves performance and accuracy.
In MySQL, you can safely delete duplicate data by utilizing subqueries or temporary tables. Perform deletion operations cautiously, with thorough verification and a solid backup system in place.
6. Performance Considerations and Index Strategies
When extracting or deleting duplicate data in MySQL, the larger the table’s data volume, the more the query execution speed and server load become issues. Especially in large-scale systems or batch processing, performance-conscious SQL design and index optimization are important. In this chapter, we will explain tips for performance improvement and key points for index design in duplicate data processing.
6-1. Using EXISTS, IN, and JOIN Appropriately
The SQL syntax such as IN
, EXISTS
, and JOIN
used when extracting duplicate data each have different characteristics and performance tendencies.
- IN
It is fast when the subquery’s result set is small, but performance tends to degrade as the results increase. - EXISTS
It stops the search as soon as a record that meets the condition is found, making it effective for large tables or when the number of matches is small. - JOIN
By joining, you can obtain a lot of information at once, but joining unnecessary data can actually slow things down.
Performance Comparison Example
Syntax | Small Data | Large Data | Comments |
---|---|---|---|
IN | ◎ | △ | Large result sets are slow |
EXISTS | ◯ | ◎ | Advantageous for large-scale DBs |
JOIN | ◯ | ◯ | Appropriate indexes required |
It is important to select the optimal syntax based on the actual system and data volume.
6-2. Importance of Index Design
For columns used in duplicate checks or as deletion targets, always set indexes. Without indexes, a full table scan occurs, making it extremely slow.
Index Setup Example
CREATE INDEX idx_email ON users(email);
Even when determining duplicates with multiple columns, composite indexes are effective.
CREATE INDEX idx_name_birthday ON users(first_name, birthday);
Depending on the index design, read speed and search efficiency can change significantly.
However, creating too many indexes can cause write speed degradation and increased storage, so balance is important.
6-3. Batch Processing for Large Amounts of Data
- When the data volume is on the scale of tens of thousands to millions of records, rather than processing all at once, it is safer to divide into batches and execute.
- For deletions or updates, limit the number of records processed at once (e.g.,
LIMIT 1000
) and execute in multiple runs to prevent lock contention and performance degradation.
DELETE FROM users
WHERE id IN (
-- First 1000 records among the duplicate records extracted by the subquery
)
LIMIT 1000;
6-4. Utilizing Execution Plans (EXPLAIN)
Use EXPLAIN
, which allows you to check the SQL execution plan, to analyze how the actual query is being executed. You can check in advance whether indexes are being effectively utilized and whether full table scans (ALL) are occurring.
EXPLAIN SELECT * FROM users WHERE email IN (...);
In this way, by being mindful of performance and index strategies, duplicate processing of large amounts of data can also be done safely and efficiently.
7. Advanced Examples: Handling Complex Cases
In real-world business environments, beyond simple duplicate detection and deletion, advanced needs such as “adding more complex conditions” or “wanting to execute safely in batches” are not uncommon. This chapter introduces applied practical examples, including duplicate processing with multiple conditions and operational techniques emphasizing safety and reliability.
7-1. Duplicate Deletion with Multiple Conditions
When you want to delete only those among duplicate records that meet specific conditions, make effective use of the WHERE clause.
Example: Delete only duplicates with the same email address and “withdrawn” status
DELETE FROM users
WHERE id NOT IN (
SELECT * FROM (
SELECT MIN(id)
FROM users
WHERE status = 'withdrawn'
GROUP BY email
) AS temp_ids
)
AND status = 'withdrawn';
By adding conditions to WHERE
or GROUP BY
like this, you can narrow down the records you want to keep or delete.
7-2. Recommendations for Batch Processing and Split Execution
If the data volume is huge or you want to avoid lock contention and performance degradation, make use of batch processing.
- Instead of processing all deletion targets at once, use
LIMIT
for split execution - Use transaction control to rollback in case of any errors
- Risk management through backups and log output
DELETE FROM users
WHERE id IN (
SELECT id FROM (
-- Extract duplicate record IDs filtered by conditions
) AS temp_ids
)
LIMIT 500;
By splitting like this, you can significantly reduce the load on the system.
7-3. Handling Complex Duplicate Definitions
When the conditions considered as “duplicates” differ by business, flexibly respond by combining subqueries, CASE statements, and aggregate functions.
Example: Consider as duplicates only when “same product ID, same order date, and matching price”
SELECT product_id, order_date, price, COUNT(*)
FROM orders
GROUP BY product_id, order_date, price
HAVING COUNT(*) > 1;
Furthermore, complex judgments such as “keep only the newest among duplicates” can be achieved using subqueries or ROW_NUMBER()
(MySQL 8.0 and later).
7-4. Best Practices for Transactions and Backups
- Always wrap deletions or updates in transactions so that if a problem occurs midway, you can revert with
ROLLBACK
. - For important tables or handling many records, always take a backup in advance. It’s safe to perform deletion work after the backup.
By mastering these applied and advanced techniques, you can flexibly and safely handle duplicate data processing in any environment.
8. Summary
In this article, we have systematically explained the extraction and deletion of duplicate data using MySQL, from basics to advanced topics. Here, let’s review the main points again.
8-1. Review of the Article’s Key Points
- Detecting Duplicate Data
It is possible to extract duplicates not only from a single column but also by combining multiple columns. The combination ofGROUP BY
andHAVING COUNT(*) > 1
forms the basic pattern for all duplicate detection. - Extracting All Duplicate Records
By utilizing subqueries or theEXISTS
clause, you can retrieve all records corresponding to duplicate key values. - Deleting Duplicate Records
While keeping representative values withMIN(id)
orMAX(id)
, by applying subqueries + DELETE statements, you can safely delete only unnecessary duplicates. Avoiding MySQL error 1093 is also an important point. - Performance and Indexes
For processing large-scale data or complex conditions, appropriate index settings, batch processing, and checking the execution plan withEXPLAIN
are essential. - Practical Application Techniques
Let’s also keep in mind points to avoid failures in practice, such as conditional deletion, split execution, transaction management, and backup acquisition.
8-2. Quick Reference Table of Recommended Approaches by Use Case
Use Case | Recommended Method |
---|---|
Single Column Duplicate Detection | GROUP BY + HAVING |
Multiple Column Duplicate Detection | GROUP BY Multiple Columns + HAVING |
Retrieving All Duplicate Records | Subquery (IN/EXISTS) |
Safe Deletion | Subquery + Temporary Table for DELETE |
High-Speed Processing of Large Data | Indexes + Batch Processing + EXPLAIN |
Conditional Duplicate Deletion | Combine with WHERE Clause or Transactions |
8-3. To Prevent Future Issues
Preventing duplicate data “at the time of registration” is also important.
- During table design, consider utilizing unique constraints (UNIQUE).
- Regular cleanup and auditing of existing data also lead to early detection of operational issues.
MySQL-based extraction and deletion of duplicate data is a field that requires a wide range of knowledge and practical skills from SQL basics to advanced applications. I hope the content of this article proves useful for on-site data maintenance and system operations.
If you have any questions or specific cases, please consult the FAQ or experts.
9. FAQ: Common Questions About Extracting and Deleting Duplicate Data in MySQL
Q1. Why use GROUP BY
+ HAVING
instead of DISTINCT
?
DISTINCT
is a feature for “removing” duplicates, but it cannot investigate “which values are duplicated how many times.” On the other hand, by combining GROUP BY
and HAVING COUNT(*) > 1
, you can extract detailed information such as “which values appear multiple times” and “how many duplicates there are.”
Q2. Which should I use, IN
or EXISTS
?
For small amounts of data, there’s no difference between the two, but when tables are large or indexes are effective, EXISTS
is often faster. It’s recommended to try both in your actual environment and check the execution plan with EXPLAIN
.
Q3. How do I check for duplicates across multiple columns?
By specifying multiple columns in GROUP BY
and using HAVING COUNT(*) > 1
, you can detect duplicates where “all columns have the same combination.” Example: GROUP BY first_name, birthday
Q4. Error 1093 occurs in the DELETE statement. How do I handle it?
MySQL returns error 1093 when “referencing the same table in a subquery within a DELETE statement.” To avoid this, wrap the subquery result in a derived table (SELECT * FROM (...) AS alias
).
Q5. How can I safely delete duplicate data?
Always take a backup before deleting, and first confirm the targets with a SELECT
statement. If possible, use transactions, and perform incremental deletions or batch processing as needed.
Q6. What are the measures when queries are slow due to large data volumes?
Set indexes on the columns used for duplicate detection, and for large processing volumes, execute in batches using LIMIT
. Check the query execution plan with EXPLAIN
and ensure no unnecessary full scans are occurring.
Q7. How can I fundamentally prevent duplicate registrations?
At table design time, set UNIQUE
constraints or unique keys to prevent duplicate registrations of the same values in advance. After operations start, it’s good to periodically perform duplicate checks and data cleansing.
Q8. Can the same methods be used in MariaDB or other RDBMS?
Basic SQL syntax (GROUP BY
· HAVING
· subqueries) can be used commonly in other RDBMS like MariaDB and PostgreSQL. However, there are differences per product in subquery restrictions during DELETE and performance characteristics, so always verify in advance.