MySQL 从重复条目中清理表并在依赖表中重新链接 FK

2022-01-20 00:00:00 duplicates foreign-keys duplicate-removal mysql

这是我的情况:我有 2 个表，patient 和 study.

Here is my situation: I have 2 tables, patient and study.

每个表都有自己的使用自动增量的 PK.

Each table has its own PK using autoincrement.

就我而言，pat_id 应该是唯一的.它没有在数据库级别声明为唯一，因为它在某些用途中可能是非唯一的(它不是自制系统).我发现了如何配置系统以将 pat_id 视为唯一，但我现在需要清理重复患者的数据库并将研究表中的重复患者重新链接到剩余的唯一患者，然后再删除重复的患者.

In my case, the pat_id should be unique. It's not declared as unique at database level since it could be non unique is some uses (it's not a home made system). I found out how to configure the system to consider the pat_id as unique, but I need now to cleanup the database for duplicated patients AND relink duplicated patients in study table to remaining unique patient, before deleting the duplicated patients.

患者表:

CREATE TABLE `patient` ( `pk` BIGINT(20) NOT NULL AUTO_INCREMENT, `pat_id` VARCHAR(250) COLLATE latin1_bin DEFAULT NULL, ... `pat_name` VARCHAR(250) COLLATE latin1_bin DEFAULT NULL, ... `pat_custom1` VARCHAR(250) COLLATE latin1_bin DEFAULT NULL .... PRIMARY KEY (`pk`) )ENGINE=InnoDB;

学习桌:

CREATE TABLE `study` ( `pk` BIGINT(20) NOT NULL AUTO_INCREMENT, `patient_fk` BIGINT(20) DEFAULT NULL, ... PRIMARY KEY (`pk`), ... CONSTRAINT `patient_fk` FOREIGN KEY (`patient_fk`) REFERENCES `patient` (`pk`) )ENGINE=InnoDB;

我发现了一些类似的问题，但不是完全相同的问题，尤其是缺少与剩余唯一患者的外键链接.

I found some similar questions, but not exactly the same issue, especially it was missing the link of the foreign keys to the remaining unique patient.

重复条目的清理更新

只更新 MySQL 中重复条目的第一条记录

推荐答案

我就是这样做的.

我重用了 patient 表中未使用的字段来标记非重复 (N)、重复的第一个 (X) 和其他重复的患者 (Y).您还可以为此添加一列(并在使用后将其删除).

I reused an unused field in patient table to mark non duplicated (N), 1st of duplicated (X), and other duplicated patients (Y). You could also add a column for this (and drop it after use).

以下是我清理数据库所遵循的步骤:

Here are the steps I followed to cleanup my database:

/*1: List duplicated */ select pk,pat_id, t.`pat_id_issuer`, t.`pat_name`, t.pat_custom1 from patient t where pat_id in ( select pat_id from ( select pat_id, count(*) from patient group by 1 having count(*)>1 ) xxx); /*2: Delete orphan patients */ delete from patient where pk not in (select patient_fk from study); /*3: Reset flag for duplicated (or not) patients*/ update patient t set t.`pat_custom1`='N'; /*4: Mark all duplicated */ update patient t set t.`pat_custom1`='Y' where pat_id in ( select pat_id from ( select pat_id, count(*) from patient group by 1 having count(*)>1 ) xxx) ; /*5: Unmark the 1st of the duplicated*/ update patient t join (select pk from ( select min(pk) as pk, pat_id from patient where pat_custom1='Y' group by pat_id ) xxx ) x on (x.pk=t.pk) set t.`pat_custom1`='X' where pat_custom1='Y' ; /*6: Verify update is correct*/ select pk, pat_id,pat_custom1 from `patient` where pat_custom1!='N' order by pat_id, pat_custom1; /*7: Verify studies linked to duplicated patient */ select p.* from study s join patient p on (p.pk=s.patient_fk) where p.pat_custom1='Y'; /*8: Relink duplicated patients */ update study s join patient p on (p.pk=s.patient_fk) set patient_fk = (select pk from patient pp where pp.pat_id=p.pat_id and pp.pat_custom1='X') where p.pat_custom1='Y'; /*9: Delete newly orphan patients */ delete from patient where pk not in (select patient_fk from study); /* 10: reset flag */ update patient t set t.`pat_custom1`=null; /* 11: Commit changes */ commit;

当然有更短的方法，使用更智能(复杂?)的 SQL，但我个人更喜欢简单的方法.这也让我可以检查每个步骤是否符合我的预期.

There is certainly a shorter way, with a some smarter (complicated?) SQL, but I personally prefer the simple way. This also allows me to check each step is doing what I expect.

相关文章