openSUSE:Post-mortem-20240905

Jump to: navigation, search
  • What/Problem: various openSUSE services were disfunctional (www, status, wiki, etherpad, idm, progress, events)
  • When: 2024-09-05 00:40 to 06:20 UTC
  • Why: We identified a number of contributing issues:
    • keepalived on atlas1+2 failed to start because the "d-os-public" interface mentioned in keepalived.conf was missing due to https://bugzilla.opensuse.org/show_bug.cgi?id=1229555 => update pending.
    • automatic maintenance scheduling caused overlap of updates on both nodes, not giving enough time for the second one to automatically block updates because of failure on the first one => adjustment of timing formula discussed.
    • it broke at a time when admin availability was low.