On August 1, 2025, the DSDLink ECP website became unavailable for approximately 10 minutes. Our engineering team was immediately alerted, and the issue was promptly resolved.
The outage was triggered during a routine automated upgrade process for the system. A failure occurred while uploading a new software image to our container registry (AWS ECR), likely due to an internal networking issue at AWS. The deployment automation proceeded to update the live service with an incomplete test image.
Our on-call engineer was alerted by our monitoring systems as soon as the issue occurred. They immediately began an investigation and were able to manually roll back the failed deployment and restore the previous, stable version of the website. Service was restored within 10 minutes of the initial alert.
To prevent a recurrence of this issue, we have implemented the following improvements: