openSUSE:Post-mortem-20240513

- What: various services intermittently not reachable, some for ~5, some for up to ~25 minutes

- When: 2024-05-13 19:13 to 19:40 UTC

- Why:

The trigger was applying a change to our hypervisor cluster - planned was to apply increased RAM for two virtual machines [1, 2] which is a common and mostly automated procedure.

Unlike usually, this time more things happened than expected - some background:

We use Salt [0] as an infrastructure as code and automation solution. A few days ago, a new version of our Salt formulas was released [3], containing, amongst some new features, bulk refactoring changes which unify the handling of /etc/sysconfig files across all our formulas [4] - including the formula we use for managing the Corosync/Pacemaker HA (high availability) stack on hypervisors [5]. This change does not alter any configuration options, however it changes the formatting of the file headers (for easier identification of Salt managed configuration files by local administrators visiting a machine, we equip all such files with a "Managed by Salt" header comment).

The Salt formula is programmed to reload or restart services if affected configuration files change (and note that Salt cannot differentiate between comment and option changes) - usually, if changes to the Salt code affecting the HA stack are implemented, they are applied in sequence on affected nodes (administrator puts one node at a time in maintenance/standby mode and applies the change) - if the changes were implemented on a formula level, then this is generally done as soon as possible after updating the formula packages, to avoid unapplied changes being involuntarily applied together with minor changes.

This time, this was not done - frankly, I forgot about this bulk change affecting the header comment in the /etc/sysconfig/{pacemaker,sbd} files, did not perform the sequential apply on the hypervisor nodes after the formula update was shipped [6] and deployed on the Salt master servers (through automatic OS updates), and hence left this dangerous change "dangling".

So it happened, that the configuration comment change was included in the operation of what was intended to only be the application of two VM memory patches, causing an unintended and unexpected simultaneous restart of the HA services, in return resulting in a temporary shutdown of the virtual machines managed by them.

As I was immediately noticing the effect of my action I was able to respond quickly, but mostly monitoring of the situation was needed, as the HA stack recovered itself and the virtual machines on two out of three machines automatically.

[0] https://docs.saltproject.io/en/latest/topics/about_salt_project.html#about-salt

[1] https://code.opensuse.org/heroes/salt/c/b65d92cc4e8a54ce8854d753b6efbf6f618bb48b

[2] https://code.opensuse.org/heroes/salt/c/917932294576fcefca3a7105431f1e5be3bb574c

[3] https://github.com/openSUSE/salt-formulas/releases/tag/v2.0

[4] https://github.com/openSUSE/salt-formulas/pull/157

[5] https://github.com/openSUSE/salt-formulas/commit/a4b52824be7a77f46e07bdec6df8025d5eaac99b

[6] https://build.opensuse.org/request/show/1172988