News
When an Outage Keeps You from Reporting Said Outage: GitHub Addresses Vexing Problem
GitHub has addressed vexing outages, including recent days-long email delivery problems that had developers raging on social media, blasting the popular Microsoft-owned code repository for not being able to log in to their accounts -- or report the problem.
"[W]aiting for my email verification for 2 days now," read one of some 63 comments on a seemingly innocuous GitHub social media post last month titled "What could you achieve with GitHub Copilot?"
The answer to that question, for many, was along the lines of "not much." Many users couldn't even report the problem because it relied on emails that weren't being sent, with one developer saying, "It's ridiculous that I have to login to be able to report a login issue :-?"
A sampling of other comments includes:
- GitHub is broken. Rejecting mail in and not sending mail out.
- I'm experiencing this as well. Verification and password reset emails are not being sent. Edit: The emails finally came through after about 2 hours
- Unable to log in and retrieve password, the emails are full and the app cannot change password, the experience is very bad
- I don't receive any e-mail for verification at all. The funny thing is that to write to the support you need to write the verification code
- No email with password reset, cannot contact support because verification email doesn't work, and when I try to send you an email I get the below answer [same as screenshot above]. Mates, you need to fix it ASAP, people are locked out of their accounts
In its recently published "GitHub Availability Report: April 2024," the company explained the problem and how it was being addressed. The report noted, "In April, we experienced four incidents that resulted in degraded performance across GitHub services." The email issue was the longest-lasting and apparently most concerning to users.
The Problem
April 11 08:18 UTC (lasting 3 days, 4 hours, 23 minutes)
Between April 11 and April 14, GitHub.com experienced significant delays (up to two hours) in delivering emails, particularly for time-sensitive emails like password reset and unrecognized device verification. Users without 2FA attempting to sign in on an unrecognized device were unable to complete device verification, and users attempting to reset their password were unable to complete the reset. The delays were caused by increased usage of a shared resource pool, and a separate internal job queue that became unhealthy and prevented the mailer queue from processing.
The Solution
Immediate improvements have been made to better detect and react to similar situations in the future, including a queue-bypass ability for time-sensitive emails and updated methods of detection for anomalous email delivery. The unhealthy job queue has been paused to prevent impact to other queues using shared resources.
The report also detailed problems of much shorter durations, ranging from 30 minutes to 2 hours, and the company's response to those issues.
The company's status page is available to check on how GitHub is doing at any given time, with the latest report showing all systems operational. An incident history site is also available for information on past problems.
Oops
While this article was being written, these messages popped up on the status page:
Just minutes after the above messages popped up, all systems returned to operational status. So problems do happen with complicated software systems, as any developer reading this well knows. But GitHub is good at keeping users apprised and informed about what's going on, and what's being done about it.
About the Author
David Ramel is an editor and writer at Converge 360.