Using SQLite to find patient IDs present in one table yet missing from another table.
Table of Contents
↑The problem statement
At work today, I had to review two lists of patient identifiers. Each list came from a different clinical system and represented the cohort of patients in those respective systems. One list was extensive, with thousands of identifiers. The other list was smaller. So I needed to find which identifiers in the smaller list weren’t present in the more extensive list.
The smaller list still had too many entries to check by hand, and this verification process needed several attempts as administrators tidied up each list. Therefore, it made sense to automate the verification process.
I realised I could automate this using SQLite. I could write a bash script to:
- Drop and re-create two database tables between each run cycle.
- Import each CSV file containing the identifiers into its relevant table.
- Run an SQL query to report which identifiers from table a can’t be found in table b.
↑The bash script
Here is the bash script I came up with:
1#!/bin/bash
2
3sqlite3 -cmd ".read create_tables.sql"
4 -cmd ".mode csv"
5 -cmd ".import cohort1.csv small_cohort"
6 -cmd ".import cohort2.csv large_cohort"
7 -cmd ".read select_results.sql"
8 dupes.db
The -cmd parameter of the sqlite3 command-line utility allows you to specify which commands need executing. We need to use multiple -cmd sequences to execute various commands in order. In the example above, we start by reading and running the create_tables.sql file, which drops and re-creates the tables.
We then switch the mode of operation to CSV before importing the two CSV files. The .import command has two parameters. The first parameter is the name of the CSV file containing your data. The second parameter instructs SQLite on which table the data should be imported into. Finally, we read and execute the select_results.sql file, which will find and list the missing patient identifiers.
↑The create table SQL
The SQL code below shows the contents of the create_tables.sql file. It starts by dropping the two tables to clear out the data from the previous verification run and then re-creates the table definitions. Each table has one field to hold the patient identifier. On the large cohort table, we also add a primary index to speed up the searching of entries in that table.
1DROP TABLE IF EXISTS small_cohort;
2DROP TABLE IF EXISTS large_cohort;
3
4CREATE TABLE small_cohort (
5 "nhs" TEXT
6);
7
8DROP TABLE IF EXISTS large_cohort;
9CREATE TABLE large_cohort (
10 "nhs" TEXT NOT NULL PRIMARY KEY
11);
↑The SQL select statement to find records not found in the large cohort table
1SELECT *
2 FROM small_cohort
3 WHERE nhs NOT IN (
4 SELECT nhs FROM large_cohort
5 );
The above SQL code is the content of the select_results.sql file, which selects all the records from the small_cohort table, which don’t have a corresponding record in the large_cohort table.