Project Top Gun

Instrumentation

We need some form of “Application Performance Monitoring” so that we can monitor the capacity of the environment, as well as the effect of any changes made to parts of the System.

We used to use Sensu/Grafana for this. The Internet suggests that Zabbix isn’t suited to this. Sentry may be an option.

This should include monitoring -

Queued Requests
Requests Processed per Second/Minute
Average time taken per Request

Other:

https://renalregistry.atlassian.net/browse/INF-497

NHSBT Importer

This isn’t related to the flow of Renal Unit RDA files but is putting an unnecessary load on the system (details in ticket) so would be a quick win to free up resources.

TODO:

https://renalregistry.atlassian.net/browse/TNG-882

Data Purge Script

Again not related to the flow of Renal Unit RDA files but it is taking a long time (days) to clear out data to allow the next round of testing. This will likely be fixed by the addition of Foreign Keys to the database.

See - https://github.com/renalreg/data_extract/blob/master/scripts/ukrdc_purge.py

UKRR Quarterly Extract

In order not to get behind a year’s clinical data needs to be loaded into the UKRR database in less than a year. There are 4 quarters, approximately 70 sites, and each site can be assumed to require at least two extracts during the course of its processing. This means that at a minimum we need to be capable of generating two files per day but in practice it needs to be much quicker than that.

Incoming File Processing / Decryption

We receive the RDA SFTP files on sftp.ukrdc.org (Internet) and sftp.ukrdc.nhs.uk (HSCN).

A script ( https://github.com/renalreg/ukrdc-transfer/blob/master/scripts/gpg_decrypt.py ) monitors a range of incoming folders for new files, decrypts them and moves them to archive/feed directories.

Assuming the logs are correct it seems able of processing a file in a millisecond.

Some options to possibly increase speed:

Although the code is written with a worker architecture I think only one is created.
The code creates a subprocess to launch the GPG executable to do the decryption with each file it processes. There may be a way to do this more efficiently. Some similar scripts of ours use https://gnupg.readthedocs.io/en/latest/ however I think from a brief read that this operates in a similar manner under the wrapper.

Each Mirth server then has a crontab which runs two instances of https://github.com/renalreg/rdadownload/blob/master/rdadownload.py , one to download files from each SFTP server.

TODO:

The code on the servers appears not to match Git. Additionally there are copies of the script for each server rather than one with different command line parameters.
The cron jobs do not use flock etc. to prevent multiple concurrent runs.
The cron jobs are scheduled to run every 30 minutes. This long gap could result in processes standing idle when they otherwise might be processing files.

Mirth Processing

Mirth General

Each channel in Mirth can be set to have a number of workers. There are also settings about whether each step is independent, or whether it needs to wait until the message has made its way through all the channels before starting processing the next one.

This offers considerable potential to increase performance, although it is important to consider any issues about concurrent processing of files - most commonly when there are two files for the same patient being processed at once.

Finally, this shouldn’t be an issue as most processing is done outside of Mirth but the amount of resources available to Mirth are set within the JVM settings in the Mirth configuration file, so would need to be increased if the amount of memory on the VM was increased.

The performance of each step within Mirth can be monitored using entries in the MirthDB. (TODO: Provide example SQL).

Python General

QUESTIONS:

We have put an NGINX instance in front of the Gunicorn process serving the “WebAPI” services. Is this the most efficient way of doing it?

TODO:

Deploy https://renalregistry.atlassian.net/browse/TNG-944

1 - Duplicate Checker - TODO

With respect to the “PV XML” files those which are sent to pv1.patientview.org (KDA) , but not (I think) those which may go to either “new” SFTP servers goes via a process called filefilter ( https://github.com/renalreg/pvfeed-scripts/blob/master/pvfeed/filefilter.py ). This checks whether the incoming file is identical to one which has been received previously but also whether files are being rejected due to the lack of a “PV” (NOT “PKB”) membership.

There were a number of issues with this process so Andre has made an improved version which is designed to function as a Mirth Channel. https://github.com/renalreg/duplicate_services/ .

This is yet to be deployed.

There also isn’t a function to perform the equivalent steps for “RDA XML” files.

Note that as far as I’m aware we have not checked how much faster normalising and comparing files is compared to just processing them as usual. I think some of the impetus for developing this was to avoid space issues with HealthShare by having fewer files reaching it.

2 - RDA Conversion

The files are processed via - https://github.com/renalreg/rda_xml_schema_conversion

The work done by this process should lessen once we’ve moved to CUPID and sites have implemented fixes to some of the things it is handling.

3 - RDA Validation

The files are processed via https://github.com/renalreg/rda_services/blob/main/rda_services/validate_rda_data/validation.py

This carries out a number of checks to the XML, each of which could cause the file to be rejected.

4 - Java EMPI / Repository

The files are processed via https://github.com/renalreg/Data-Repository . This is done by Mirth calling the functions from the Java Classes, rather than a HTTP call as with everything else.

As part of the process the patient’s identify is checked with the EMPI which involves calling https://github.com/renalreg/jtrace .

Both of the above will be replaced by CUPID.