Extracting Covid-19 Vaccination Records form an HTML table of results
How to use GoQuery to extract vaccination records from an HTML table of results.
Table of Contents
↑Covid-19 Vaccination Records
I’ve been investigating the possibility of automating the extraction of Covid-19 vaccination records for a cohort of patients. Having the vaccination record would help pharmacy departments plan and prepare treatment for these identified patients.
To achieve this I used the Patient Identifier Cross Referencing (PIX) profile of the regional Health Information Exchange to lookup a patients NHS numbers based on the supplied local hospital medical record number. Once I had the NHS number I could use Healthcare Gateway’s detailed care record API to retrieve the Medications HTML view.
Once I had the medications view, I needed to parse out the vaccination records from all the medications listed within the HTML tables returned. Usually, I wouldn’t attempt to parse information out of an HTML table. Fortunately, the HTML returned by the MIG has enough structure to parse out the information we needed. For example, the HTML returned contains CSS class names on the table cells, which allowed me to determine which cells had the date and which cells had the vaccination type.
I coded the server-side API in the open-source programming language, Go. I also used the goquery package to do the heavy lifting of parsing the HTML. goquery brought a syntax and a set of features similar to jQuery to the Go language and made it easy to achieve this task.
↑Parsing the Medications Records
The Go code below shows how the detailed care record response is passed into a new goquery document. It also demonstrates using goquery to iterate through each HTML table and each row within each table.
1doc, err := goquery.NewDocumentFromReader(strings.NewReader(DCRV2Response))
2if err != nil {
3 fmt.Println("Error parsing supplied HTML")
4 log.Fatal(err.Error())
5}
6
7// find each table within the HTML
8doc.Find("table").Each(func(index int, tablehtml *goquery.Selection) {
9
10 // loop over each row within the table
11 tablehtml.Find("tr").Each(func(indextr int, rowhtml *goquery.Selection) {
12
13 })
14})
Next, I needed to make sure that I only processed rows which contained an actual Covid-19 vaccination record.
↑Dictionary of medicines and devices (dm+d)
The dm+d is a dictionary of descriptions and codes which represent medicines and devices in use across the NHS. The following dm+d AMP labels are used for the different vaccination types currently given within the United Kingdom:
- COVID-19 mRNA (nucleoside modified) Vaccine Moderna 0.1mg/0.5ml dose dispersion for injection multidose vials
- COVID-19 Vaccine AstraZeneca (ChAdOx1 S [recombinant]) 5x10,000,000,000 viral particles/0.5ml dose suspension for injection multidose vials
- COVID-19 mRNA Vaccine Pfizer-BioNTech BNT162b2 30micrograms/0.3ml dose concentrate for dispersion for injection multidose vials
So I encapsulated this into a function that would return true/false if the supplied record contained one of the dm+d labels:
1func isVaccinationRecord(record string) bool {
2 if strings.Contains(record, "COVID-19 mRNA (nucleoside modified) Vaccine Moderna") {
3 return true
4 }
5 if strings.Contains(record, "COVID-19 Vaccine AstraZeneca") {
6 return true
7 }
8 if strings.Contains(record, "COVID-19 mRNA Vaccine Pfizer-BioNTech") {
9 return true
10 }
11 return false
12}
↑Parsing the Vaccination Records
Here we’ve plugged the isVaccinationRecord() into the loop and only if true do we proceed to extract the date and vaccine given. From the code sample below, you can see that I’m using the existence of the CSS classes to determine which table cell holds the date. If the table cell has the CSS class name of onsetdate then we extract the table cell as a date. The CSS class name of term is used to extract the name of the vaccination given.
1// find each table within the HTML
2doc.Find("table").Each(func(index int, tablehtml *goquery.Selection) {
3 tablehtml.Find("tr").Each(func(indextr int, rowhtml *goquery.Selection) {
4
5 // only process rows which have
6 // dm+d COVID-19 vaccine entries
7 if isVaccinationRecord(rowhtml.Text()) {
8
9 rowhtml.Find("td").Each(func(indexth int, tablecell *goquery.Selection) {
10 if tablecell.HasClass("onsetdate") {
11 fmt.Println("Date Given: " + tablecell.Text())
12 }
13
14 if tablecell.HasClass("term") {
15 fmt.Println("Vaccine Given: " + tablecell.Text())
16 }
17 })
18 }
19 })
20})
↑Need Help or Advice
If you need any help or advice in implementing something similar then please reach out to me via Twitter. Send me a direct message to @nhsdeveloper.