Postmortem: Status Page Incident - March 19th, 2025
Summary:
On March 19th, 2025, our application experienced a period of degraded performance and intermittent access issues. This incident was triggered by an unexpected surge in new connections, which, combined with suboptimal database connection settings, led to resource exhaustion before the system could scale up. The resulted in a temporary disruption of service for some users on the 25.03 version. We successfully restored full service functionality by cycling our 25.03 containers. We have also implemented and tested updated DB connection configurations to prevent recurrence.
Timeline (March 19th, 2025):
- 12:56 PM: Internal alarms notified team of usage anomalies and site slowness, with error rates increasing.
- 1:13 PM: We initiated a recycle of all 25.03 containers to address the potential resource saturation.
- 1:27 PM: Services began to gradually recover as the rebuilt containers came online and handled the backlog of requests.
- 2:08 PM: Services were confirmed to be fully restored, and returned to normal operational status.
Root Cause:
The incident was caused by a confluence of factors:
- Unexpected Connection Surge: A significant and unanticipated increase in new connections placed a substantial load on our database servers.
- DB Connection Protocol Issue: Our application's connection to the database server had an internal issue with it’s pooling mechanism. This, in conjunction with the connection surge, led to the creation of an excessive number of database connections.
- Resource Exhaustion and Scaling: The excessive database load caused our web servers to experience performance degradation. Simultaneously, the large number of pending connections led to decreased CPU utilization on the web containers. Consequently, our auto-scaling mechanism began to scale down the web containers, exacerbating the access issues.
In essence, the high connection load, combined with the connection pool issues, caused database slowdowns, and the web containers to attempt to scale down, making the page unavailable for some users.
Resolution:
- Immediate Mitigation: We immediately recycled the 25.03 service containers to clear the existing connection backlog. This action restored service functionality and addressed the immediate access issues.
- Long-Term Solution: We have since corrected the underly connection library issue. This update has been rigorously tested to ensure they can handle high connection loads without causing resource exhaustion or performance degradation. We are also implementing additional monitoring to catch unexpected spikes in connection requests.