On 2021-06-02 at 11:20 UTC our engineers identified an issue where our infrastructure was not collecting data for Heatmaps and Recordings for a number of sites.
On 2021-06-01 at 09:00 UTC (the previous day), a number of sites started being over-sampled. This caused the Hotjar Tracking Code to stop sending as many Recordings and Heatmaps data for these sites.
On 2021-06-02 at 11:00 UTC, this issue resolved, causing sites to be under-sampled. Throughout the day, our engineers manually mitigated the impact of this surge of traffic coming to our infrastructure. By 22:30 UTC, this surge had ended and the traffic had returned to normal patterns.
Why did this issue occur?
On 2021-06-01 at 09:00 UTC, the part of our infrastructure that calculates how much each site should be sampled failed to run for 24 hours. This caused sites that were being heavily sampled to stop receiving data when their traffic decreased. Our automated failsafe restarted this infrastructure after 24 hours. This caused previously sampled sites to send more traffic to our servers, which our engineers monitored and manually mitigated throughout the day.
What will we do to prevent this from happening in the future?
We have updated our failsafe to enable after 20 minutes. We will be holding a post-mortem to investigate further possible remedies or mitigations.