On 2022-05-24, NamePros went offline from 7:13 AM to 7:17 AM EDT (UTC-4), and again from 7:20 to 7:31 AM EDT (UTC-4).
Beginning around 8:15 PM EDT the previous night, all but one web server began experiencing intermittent networking issues. This isn't normally cause for immediate concern, as the servers are supposed to be replaced automatically if they're unhealthy for any significant duration. However, due to an autoscaling misconfiguration, the servers weren't replaced.
The issue worsened around 4:45 AM EDT. All but one web server became unusable, causing all requests to be routed to the one remaining server. As the morning progressed and request rate increased, the server became overloaded and dropped offline at 7:13 AM, eventually recovering at 7:17 AM. At 7:20 AM, seeing a high error rate from a single server, we attempted to manually replace the server, expecting the other servers to take over--but the other servers were already out of service. This brought the site offline again. This prompted further intervention, during which time the autoscaling configuration was reset to a known-good state, allowing the site to recover at 7:31 AM as new servers came online.
A subsequent investigation revealed the misconfiguration but was unable to determine the underlying cause of the networking issue.