thorntale Logo
 Log In
 Present

Postmortem Template

March 20, 2024
ellen@thorntale.com

Pie Day Outage 2024

Date: 2024-03-14
Authors: A. Baker
Status: Resolved

Summary

Pie Facts service was degraded for 2 hours due to extremely high interest in Pies during Pie Day.
 
Additional Context: Pie Facts is a website that presents interesting facts about the requested Pie. Each Pie is responsible for curating responses about themselves; during times of high load, Pie Facts will use cached responses.

Impact

~158k Pie Facts queries failed outright
~5k Pie Facts queries returned incorrect responses
Potential lost in user trust in the accuracy of Pie Facts.
Loading...

Root Cause Analysis

Due to unexpectedly high Pie Facts load on Pie Day, our contracted Pies (Apple, Pecan, and Pumpkin) were unable to answer all questions. Under normal circumstances, this would not have caused an outage, but the Pie Facts cache was mistakenly poisoned, leading to invalid answers.

Timeline

2024-13-14

15:00: Triggering Event
15:17: Incident 5923 opened
16:03: Partial Mitigation
18:13: Incident 5923 Resolved

Pre-incident Steady State

The Pie Facts service consists of a web interface where users can interact with their favorite Pies. During the incident, PieCorp had three contracted Pies online: Apple, Pecan, and Pumpkin. The Pies respond in real time to user questions through the Piefleuncer interface.
 
During times of high load, cached responses are served to users, based on keyword matches to previous questions to a particular Pie (linked by unique PieID). A cached response is served when a question has been in the queue for more than 60s.
 
On a normal day, about 30% of questions are served by the cache, unevenly during peak load times.

Trigger (2024-03-14, all times UTC)

13:04: Influencer DonutEatThat posts on Cakestagram sending their 9M followers to Pie Facts.
13:05-14:30: Pie Facts query load increases rapidly. By 14:30, 95% of questions are served from the cache.
 
Loading...
 
15:00: Pumpkin (PieID: Pumpkin), attempts to change their PieID to 🎃Pumpkin!🥧 to commemorate Pie Day.
 
While the Piefluencer app has been updated to support emojis, Pie Facts does not support emojis, and changes Pumpkin's PieID to an empty string.
Loading...
Once Pumpkin's PieID became invalid, all requests directed to them were cache misses, and they began populating the empty-string cache key instead, poisoning the cache.

Incident (2024-03-14, all times UTC)

15:00: Pumpkin's PieID is now the empty string. Pumpkin continues to handle requests, without noticing any problems, since the Piefluencer app renders the emoji-containing PieID.
15:12: Support receives a complaint from a user that a question they directed towards Pumpkin seems to have been answered by Apple.
15:17: Incident 5923 opened.
15:29: Engineering searches Pie Facts logs and find that Pumpkin's PieID is empty. This caused the cache to search for all answers, for any Pies, to answer questions directed towards Pumpkin. Meanwhile, Pumpkin's cached answers under the empty string PieID are being served to answer any question, to any Pie.
15:45: Partial Mitigation. Engineering resets Pumpkin's PieID to its original value. It is determined that there's no way to clear the cache specifically for entries with an empty string PieID, so the Pie Facts cache is cleared entirely. The cache begins to repopulate with fresh answers, but queue times spike to unacceptably high levels.
16:14: Engineering manually repopulates the cache by running an experimental synthetic traffic runner using cleaned logs to generate question-response pairs.
16:25: Incident 5923 Resolved. Queue times back to normal after monitoring.
 
 

Action Items

[P0] Update Piefleuncer app to disallow PieIDs that are unsupported by Pie Facts
[P0] Separate PieID from Pie Screenname, and disallow editing the former
[P1] Create a utility to purge Pie Facts cache based on certain keys
[P1] Improve Pie onboarding experience to educate Pies about the limitations of the PieCorp platform
[P2] Add emoji support to Pie Screenname