Using MySQL's full-text search to find a document needle in an audit haystack

Written by Paul Bradley

midjourney ai - full-text search, cinematic, 8K, anime

Table of Contents
  1. Problem Statement
  2. Attempts that caused full table scans
  3. Adding a FULLTEXT Index
  4. Using MATCH on the FULLTEXT Index

Problem Statement

A clinical system supplier stores its ATNA audit logs into a MySQL database. The audit log database table contains fields for the date/time and transaction type. It also includes the raw XML (Extensible Markup Language) of the ATNA message. The XML complies with the schema for the RFC3881 specification for Security Audit and Access Accountability.

We needed to search the audit records to determine which clinicians viewed a given document.

Each document in the system has a unique ID in the format of:-

11.2.40.0.13.1.1.2117640092.20181111015704495.46405

The ATNA audit XML message can be large, depending on the transaction type, and the document ID is buried deep within the message. Hence it’s like looking for a needle in a haystack, especially when the audit table can hold 600 million records annually.

Attempts that caused full table scans

The first attempts at finding the audit records which contained the unique document ID are shown below.

While both these SQL queries did find the correct records, they took hours to run. In addition, these queries strained the MySQL server as they performed full table scans, checking every row/record in the database.

 1SELECT AUDIT_TRAIL_ETYPEDESC,
 2       AUDIT_TRAIL_EDATETIME,
 3       AUDIT_TRAIL_XML
 4  FROM audit_trail 
 5 WHERE AUDIT_TRAIL_XML LIKE "%1.2.40.0.13.1.1.2117640092.20181111015704495.46405%";
 6
 7SELECT AUDIT_TRAIL_ETYPEDESC,
 8       AUDIT_TRAIL_EDATETIME,
 9       AUDIT_TRAIL_XML
10  FROM audit_trail 
11 WHERE INSTR(AUDIT_TRAIL_XML, "1.2.40.0.13.1.1.2117640092.20181111015704495.46405") > 0;

We needed to find a quicker way of mining the audit records.

Adding a FULLTEXT Index

So we started by adding a full-text index on the XML field by using the following command:

1CREATE FULLTEXT INDEX idx_audit_trail_xml ON audit_trail(AUDIT_TRAIL_XML);

We then verified that adding the index didn’t affect INSERT speed and throughput as the table is a write-heavy table within the database.

Using MATCH on the FULLTEXT Index

Now that we had the full-text index, we could alter the SQL query to ensure that the index was used.

As you can see from the query below, we use the MATCH keyword to invoke the full-text index. So we’re matching AGAINST the three words separated by spaces. The three words are the last three number segments of the document’s unique ID.

MySQL’s full-text implementation breaks words on the full stop (.), so its index will contain the three number segments as individual words. As we’ve included the words to match within speech marks (“), we’re telling MySQL that the three words must appear next to each other within the message.

1SELECT AUDIT_TRAIL_ETYPEDESC,
2       AUDIT_TRAIL_EDATETIME,
3       AUDIT_TRAIL_XML
4  FROM audit_trail
5 WHERE MATCH(AUDIT_TRAIL_XML)
6AGAINST ('"2117640092 20181111015704495 46405"' IN BOOLEAN MODE)
7GROUP BY AUDIT_TRAIL_PK;

The above SQL query returned the results in milliseconds instead of hours.

You may ask why I’m only using the last three number segments. Why not use the whole number? The MySQL implementation doesn’t index words less than three characters long.

So the numbers highlighted 1.2.40.0.13.1.1.2117640092.20181111015704495.46405 wouldn’t be in the full-text index. Luckily, there is enough uniqueness in the last three number segments to allow us to find that needle in the audit haystack.