AIMES issues

Ongoing - Contract addendum hell

  • Every minor change requires:

    • Ticket raised requesting changes

    • Change request goes to account manager

    • Account manager sends a quote

    • Quote agreed

    • Account manager sends contract addendum to Steph

    • Steph signs addendum, adds it to the every growing pile

    • Once signed, works are scheduled

    • Some time later, works get completed. Occasionally, if the stars align, this happens without issues.

  • Aside from extra effort, this makes lead times for what should be 5 minute jobs extend into days or, typically, weeks.

  • Cloud credits agreed to be added as a “final” addendum on 15/11/2023

Part 2 - Did it make any difference

  • Ordering new MS365 licenses before moving to account credit took around a month.

  • Ordering new MS365 licenses after moving to account credit took around a month.

    • Did occur over the holidays, so hopefully better next time. I’m not holding my breath though.

Ongoing - Missing basic features

  • Emergency shell access (e.g. if SSH goes down)

  • No way to Power On/Off/Reboot a VM via a UI.

  • No At-Rest Encryption.

  • No way to view all active firewall rules within our network

  • No way to view a full inventory of our servers

    • There’s the ITGlue documentation but it’s manually written and constantly out of date

Ongoing - Intermittent storage speed alerts

  • Zabbix intermittently alerts us that storage across our entire infrastructure is running problematically slowly

  • Raised mid 2023 (T20230905.0040) but Lisa said their monitoring isn’t showing any problems, so it mustn’t be an issue. Ticket closed.

  • Issue still occurs from time to time, but literally not worth the effort of opening tickets.

Ongoing - That constant feeling that no matter how many things we have going on, we should never ask AIMES to do more than one thing at a time otherwise something will get missed

Ongoing - Intermittent radio-silence on simple tickets

  • T20231106.0006 raised 06/11/2023

    • Request for info on max RAM availability for each server

    • Assigned to Heather on 06/11/2023

    • No response as of 20/11/2023, chased up in morning

    • 20/11/2023 16:17 ticket closed abruptly with note “Inforamtion sent to Joe Moorhead”

Mid 2022 - NBT migration communication issues

https://renalregistry.atlassian.net/wiki/spaces/~150928124/pages/2260762674/AIMES#NBT-migration-timeline

Mid 2022 - Zabbix firewall rules communication issues

https://renalregistry.atlassian.net/wiki/spaces/~150928124/pages/2260762674/AIMES#Zabbix-timeline

Late 2022 - PatientView outage

https://renalregistry.atlassian.net/wiki/spaces/~150928124/pages/2274394185/Internal+Incident+Post-Mortem

2016 - Mid 2023 - Lost server hardware

  • AIMES kept no record of the serial numbers of hardware ordered by them and delivered directly to them

  • Servers may have been decommissioned without authorization from the UKKA, but their records of what happened to which servers are held against serial numbers, which they never wrote down or sent to us

Mid 2023 - “Data Manager” VDI not built to Requirements etc.

See T20230324.0026 .

Mid 2023 - Teams Phone Numbers

  • Request for extra phone numbers, along with the process for these requests in future, raised 24/07

  • Chased 02/08

  • Chased 07/08

  • 22/08 - Email from Lisa, asking if the users exist in 365 (answered yes the same day)

  • Chased 30/08 - informed aimes didn’t have an account on our 365, despite them having done the initial configuration, password reset and sent over

  • Chased 06/09 - Request was then completed for some of the users requested but not all

  • 15/09 - Numbers assigned to all requested users. I (MR) am still none the wiser as to the actual process this is supposed to go through for new users

Mid 2023 - SQL server failure

  • 17/07/2023 10:32 ticket raised

    • “Hi Support,

      Our SQL server (RR-SQL-LIVE - 172.31.158.34) seems to have rebooted itself at 1 am last night and installed an update.

      This update seems to have failed resulting in the error "Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online."

      As a result the SQL service now won't start. Can this please be looked into as a priority?

      We have data backups from 11PM last night (but not the master DB) so reverting the server to yesterday's backup and then restoring the DBs might be an option if there's no quicker way?

      Many Thanks,
      Michael”

  • 17/07/2023 10:51 ticket acknowledged

  • 18/07/2023 15:09 AIMES claim everything is fine due to presence of an unrelated library file

  • 24/07/2023 17:02 call arranged to discuss

  • 27/07/2023 17:49 AIMES raise a ticket with Microsoft

  • 15/08/2023 17:48 ticket escalated

  • 21/08/2023 17:07 AIMES give up, suggest a chargeable rebuild

  • 22/08/2023 16:30 call with Glenn Roberts, AIMES agree to do rebuild for free as they hadn’t realised we first raised the issue so early

  • 18/09/2023 15:55 AIMES chase up as no work done on the rebuild so far

  • 02/10/2023 14:21 works finally begin

  • 17/10/2023 15:58 timeline for data transfer and cutover agreed. Cutover scheduled for 01/11/2023 08:00

Late 2023 - Unlicensed rebuild SQL server

  • 31/01/2023 18:00 Michael finds rebuilt server is unlicensed, unable to complete transfer, raises this with AIMES

  • 01/11/2023 09:52 Joel chases this up with AIMES.

  • 01/11/2023 12:10 AIMES apologise for the error received, do not acknowledge the root cause or disruption

  • 01/11/2023 15:08 AIMES re-install SQL server, licensing issue apparently caused by them having installed the wrong version

  • 01/11/2023 16:12 AIMES and Michael agree to attempt migration again tonight, with IP cutover tomorrow 8am pending confirmation of successful transfer

January 2024 - VPN Outage due to failed certificate renewal

Screenshot 2024-01-20 215010.png
  • Certificate used for all AnyConnect VPN clients was allowed to expire

      • Common Name: *.aimes.services
        Subject Alternative Names: *.aimes.services, aimes.services
        Organization:
        Organization Unit:
        Locality:
        State:
        Country:
        Valid From: December 19, 2022
        Valid To: January 20, 2024
        Issuer: Go Daddy Secure Certificate Authority - G2, GoDaddy.com, Inc. Write review of GoDaddy
        Key Size: 2048 bit
        Serial Number: dbe0653f36391488

  • Noticed around 22:00 Saturday 20th January

  • Reported as P1

  • Call from support shortly after

    • Support agent said this would be billable out-of-hours support

    • We pointed out this cannot possibly be true given the issue is entirely due to ARO failing to renew a critical certificate. This is basic stuff.

    • Support agent suggested they would look at it “first thing on Monday” and be resolved “by 8:30am”

  • Sunday 2st ticket escalated as UKKA management want it sorted before Monday

    • Updated request not acknowledged

  • George chased up around 6am Monday

    • No acknowledgement

  • 09:24am Monday: “Good morning, the issue is being worked on by Richard Johnston in the network team , we are looking into a potential certificate issue on one of the core Firewalls. Unfortunately I don't have an update on potential fix time at this moment.”

  • 10:45am: “We are just in the process of applying a new certificate, the CA has issued this due to some (as yet) unknown issue with the current certificate.”

    • From George: “I don't think it's as-yet unknown, it's that current SSL certificate expired on Saturday at 1:45pm without having been renewed in advance.”