Recordings and Heatmap data for some customers not being collected

Incident Report for Hotjar

Postmortem

On 2021-06-02 at 11:20 UTC our engineers identified an issue where our infrastructure was not collecting data for Heatmaps and Recordings for a number of sites.

What happened?

On 2021-06-01 at 09:00 UTC (the previous day), a number of sites started being over-sampled. This caused the Hotjar Tracking Code to stop sending as many Recordings and Heatmaps data for these sites.

On 2021-06-02 at 11:00 UTC, this issue resolved, causing sites to be under-sampled. Throughout the day, our engineers manually mitigated the impact of this surge of traffic coming to our infrastructure. By 22:30 UTC, this surge had ended and the traffic had returned to normal patterns.

Why did this issue occur?

On 2021-06-01 at 09:00 UTC, the part of our infrastructure that calculates how much each site should be sampled failed to run for 24 hours. This caused sites that were being heavily sampled to stop receiving data when their traffic decreased. Our automated failsafe restarted this infrastructure after 24 hours. This caused previously sampled sites to send more traffic to our servers, which our engineers monitored and manually mitigated throughout the day.

What will we do to prevent this from happening in the future?

We have updated our failsafe to enable after 20 minutes. We will be holding a post-mortem to investigate further possible remedies or mitigations.

Posted Jun 07, 2021 - 09:53 UTC

Resolved

The issue has been resolved. Thank you for your patience!

Posted Jun 02, 2021 - 18:45 UTC

Update

The issue with Recordings and Heatmaps has been resolved for most customers. We are still monitoring and tackling issues related to Recordings and Heatmap data capture being delayed.

We'll continue to provide updates as we have more information.

Posted Jun 02, 2021 - 15:46 UTC

Monitoring

Our engineers have identified an issue that has caused sites to receive fewer Recordings and Heatmap data than they should have over the past 24 hours. The issue has been resolved and we are currently monitoring our data processing.

We'll provide an update once we have more information.

Posted Jun 02, 2021 - 10:44 UTC