Not only the site of 500 million users was down for almost 3 hours, but also the “Like” buttons all over the Web were not working, causing Facebook’s greatest outage in 4 years. In fact, the issue was so serious the entire site had to be shut down in order to get fixed.
In a blog post on their site, Facebook started by apologizing to their users and explained the issue. “An automated system for verifying configuration values ended up causing much more damage than it fixed.” said Robert Johnson in the official statement.
According to Facebook, the big problem was everytime a client attempted to access one of the databases, it was interpreted as an invalid value, deleting the corresponding caché key. “This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover. The way to stop the feedback cycle was quite painful – we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.”, the official statement added.
Facebook has already shut down the automated system that caused the problem and is working normally. This is the second time this week that Facebook collapses, the first one was on Wednesday, but the one on Thursday was the more serious one.