Convert RTF Radiology reports to plain text to increase interoperability.
Chaining Linux commands together to convert RTF files to plain text reports increasing the number of devices that are capable of consuming the content.
Introduction
Most Radiology Information Systems (RIS) publish their Radiology reports as plain text. These plain text reports may contain a control sequence to indicate which text needs to bolded. But on the whole, the report is readable on almost every computing device.
A small number of RIS systems publish their Radiology reports as Rich Text Format (often abbreviated RTF) is a proprietary document file format developed by Microsoft. The RTF file format is a verbose text file containing lots of formatting tags. As such, its not widely supported by web browsers, especially on mobile devices.
If you’re wanting to share Radiology reports in a regional health exchange, then it’s much easier to do if those reports can be easily rendered/viewed within a standard web browser.
The sample RTF extract below demonstrates how verbose the formatting tags are within an RTF file. It’s a tag soup.
1{\rtf1\sstecf22000\ansi\deflang2057\ftnbj\uc1\deff0
2{\fonttbl{\f0 \fswiss Arial;}{\f1 \fswiss \fcharset0 Arial;}{\f2 \froman \fcharset2 Symbol;}{\f3 \fmodern \fcharset0 Courier New;}{\f4 \fnil \fcharset2 Wingdings;}{\f5 \fswiss \fcharset0 Calibri;}}{\colortbl;\red0\green0\blue0;\red0\green0\blue0;\red0\green0\blue0;\red0\green0\blue0;}{\stylesheet{\f0\fs24 Normal;}{\cs1 Default Paragraph Font;}{\s2\snext2\f5\fs22\tqc\tx4513\tqr\tx9026\li0\ri0 header;}{\cs3\f0 Header Char;}{\s4\snext4\f5\fs22\tqc\tx4513\tqr\tx9026\li0\ri0 footer;}{\cs5\f0 Footer Char;}{\s6\snext6\f5\fs22\li720\ri0\sb0\sa200\sl276\slmult1\contextualspace List Paragraph;}}
I looked into how the RTF tag soup could be converted into a clean plain text report. I found that by chaining two command line utilities together I was able to produce readable Radiology reports. The two command line utilities are:
- rtf2html which converts the RTF file into structured HTML.
- html2text converts the HTML to plain text.
These two commands can be run togeather, piping the results from the rtf2html command directly into html2text.
1rtf2html < input.rtf | html2text -ascii > output.txt
Worth making this into an API?
This type of conversion could be wrapped into a web service API call. Clients would submit the RTF and the API would return the plain text file. Would anybody be interested in such an API? The API would be free and would use AWS serverless technology to deliver a scalable secure service. If anybody is interested, please send me a direct message on Twitter, I can be contacted at @nhsdeveloper.