Inky dashboard servers increase in error rates
Incident Report for Inky
Postmortem

Post incident report:

Start: 8-March-2022 1323 UTC

End: 9-March-2022 1630 UTC

Duration: 27 hr 7 min 

Summary:

The Inky Dashboard is having intermittent loading issues for some customers. 

Root Cause:

Resource usage on the servers was higher than expected.  Systems put in place to help in cleaning up the resources used were so busy they started to use more resources than expected.  The cycle continued until some systems were failing when users were trying to access the dashboard. 

Customer Impact:

Some users were unable to access the Inky Dashboard intermittently. 

Mitigation Action:

Separated each portion of the process out to its own server, so they can scale independently and quickly.  

Follow-up Items and Preventative Measures:

  1. The entire process has been split with each step having its own resource pool to draw from.  This should prevent any one step overwhelming the entire process.

  2. Monitoring has also been broken down to quickly spot an issue with any individual step in the process in addition to monitoring the health of the process as a whole.  This should assist us with spotting intermittent issues like this one.

Posted Mar 10, 2022 - 13:30 UTC

Resolved
This incident has been resolved.
Posted Mar 08, 2022 - 21:24 UTC
Monitoring
Engineers have observed that error rates have returned to normal we are monitoring to ensure that all systems remain normal.
Posted Mar 08, 2022 - 16:55 UTC
Investigating
Inky engineers are investigating an increase in error rates for accessing the Inky Dashboard.
Posted Mar 08, 2022 - 16:28 UTC
This incident affected: Dashboard Services (Dashboard Services US).