Intermittent errors when running reports/processes against the data warehouse

Incident Report for Encompass Technologies

Postmortem

Incident Summary

On March 10, 2025, between 4:21 PM and 5:42 PM MST, Encompass customers experienced intermittent failures when running reports and processes against our Snowflake data warehouse. Approximately 2% of connections to Snowflake were failing with a "Snowflake Internal Error: Unable to connect" message. The issue was fully resolved through coordinated efforts with Snowflake support.

Timeline

  • 4:21 PM MST: Our monitoring systems detected an increase in connection errors to the Snowflake data warehouse.
  • 4:25 PM MST: Incident escalated to Snowflake support to investigate potential upstream issues.
  • 5:36 PM MST: Snowflake support identified a potential resolution path involving connection refresh and server restarts.
  • 5:37 PM - 5:42 PM MST: Encompass executed a controlled container rotation across all services using the Snowflake .NET connector.
  • 5:42 PM MST: Error rates returned to normal levels (0%).
  • 6:08 PM MST: Incident moved to monitoring status.
  • 7:33 PM MST: Incident officially resolved after one hour of stable operation.

Root Cause

The investigation revealed an anomaly in the connection pooling mechanism within the Snowflake .NET driver. Our engineering team, in collaboration with Snowflake support, determined this was not related to any recent changes in Encompass infrastructure or code deployments.

Resolution

The immediate resolution involved two coordinated actions:

  1. Snowflake support refreshed connection pools on their infrastructure side
  2. Encompass performed a rolling restart of all containerized services utilizing the Snowflake .NET connector

This two-pronged approach successfully cleared the affected connection pools and restored normal service operation.

Preventative Measures

While this incident stemmed from an issue in the Snowflake .NET driver itself, Encompass is collaborating with Snowflake's engineering team on their upcoming driver release that addresses the underlying connection pooling behavior. Encompass will implement this update once it becomes available.

Impact

During the 81-minute incident window, approximately 2% of data warehouse operations experienced failures. This primarily affected report generation and certain processes that rely on data warehouse access.

We appreciate your patience and understanding during this incident. Our team remains committed to providing reliable and stable service.

Posted Mar 14, 2025 - 13:37 MDT

Resolved

This incident has been resolved. A post-mortem will be posted in 5-7 business days.
Posted Mar 10, 2025 - 19:33 MDT

Monitoring

Working with Snowflake, this issue has been fixed through a full infrastructure refresh.
We are monitoring for 1 hour to confirm
Posted Mar 10, 2025 - 18:08 MDT

Investigating

Beginning at 4:21 MST, we are seeing a spike in errors from Snowflake for "Snowflake Internal Error: Unable to connect".
This is causing intermittent failures for reports and other processes that use the data warehouse.

Issue has been escalated to Snowflake and they are investigating.
Posted Mar 10, 2025 - 16:21 MDT
This incident affected: Snowflake Data Warehouse.