/
File De-duplication Code

File De-duplication Code

Description

Many sites sending PV files send files even if the content is unchanged. This puts a tremendous load on the UKRDC which would only increase it also has a significant impact on the log data file sizes. To help alleviate this the file deduplication routines check to see if the received file is essentially unchanged from the previously received file for the patient from the sending unit. If it isn't then it is not passed on. There is a connection in place to deal with files being rejected due to lack of PV memberships so that files will be allowed through after the program membership has been created on PV even if the next received file is flag as duplicate. There may still be a need to back load older files.

Process

There is a function in the main.py file, def data_match(filepath, storage=DB, membership_directory=""), that returns either True or False. Originally, it was strictly checking file if it looks the same as previously sent and gives an answer based on that. Later it was expanded to check membership information also. If a new membership is found, function returns as a not match to process the file independently whether it is the same file as previously sent or not. Once a request is received, it strips ever changing data in the file such as date of report, results daterange, etc. Then calculates MD5 checksum of the contents and compares to the last checksum of the file for the same patient from the same unit. If it matches, it means the file is the same as previously received. Further there is another check to see if a patient has been newly registered on PV, and if so, report file as not matching, else, report file as the same.

Code base 

https://bitbucket.renalregistry.nhs.uk/projects/TNG/repos/filefilter/browse