The Grid Is Down While We Bang on Things – used during downtimes many years ago!
On Monday 11th January April Linden who is a member of the Second Life operations team posted a blog post named Why Things Were Less Than Optimal This Past Weekend in Second Life. Due to a “series of independent failures happen that produced the rough waters Residents experienced inworld”.
A master node of one of the central databases crashed on Saturday and this was one of the most used databases in Second Life. The failure caused disruption for a lot of Second Life residents during the weekend. By Sunday evening the operations team managed to re-stabilize the grid back to normal again.
Here is what happened….
On Saturday 9th January
Shortly after midnight Pacific time on January 9th (Saturday) we had the master node of one of the central databases crash. The central database that happened to go down was one the most used databases in Second Life. Without it Residents are unable to log in, or do, well, a lot of important things.
This sort of failure is something my team is good at handling, but it takes time for us to promote a replica up the chain to ultimately become the new master node. While we’re doing this we block logins and close other inworld services to help take the pressure off the newly promoted master node when it starts taking queries. (We reopen the grid slowly, turning on services one at a time, as the database is able to handle it.) The promotion process took about an hour and a half, and the grid returned to normal by 1:30am.
After this promotion took place the grid was stable the rest of the day on Saturday, and that evening.
On Sunday 10th January
That brings us to Sunday morning.
Around 8:00am Pacific on January 10th (Sunday), one of our providers start experiencing issues, which resulted in very poor performance in loading assets inworld. I very quickly got on the phone with them as they tracked down the source of the issue. With my team and the remote team working together we were able to spot the problem, and get it resolved by early afternoon. All of our metrics looked good, and I and my colleagues were able to rez assets inworld just fine. It was at this point that we posted the first “All Clear” on the blog, because it appeared that things were back to normal.
It didn’t take us long to realize that things were about to get interesting again, however.
Shortly after we declared all clear, Residents rushed to return to the grid. (Sunday afternoon is a very busy time inworld, even under normal circumstances!) The rush of Residents returning to Second Life (a lot of whom now had empty caches that needed to be re-filled) at a time when our concurrency is the highest put many other subsystems under several times their normal load.
Rezzing assets was now fine, but we had other issues to figure out. It took us a few more hours after the first all clear for us to be able to stabilize our other services. As some folks noticed, the system that was under the highest load was the one that does what we call “baking” – it’s what makes the texture you see on your avatar – thus we had a large number of Residents that either appeared gray, or as clouds. (It was still trying to get caught up from the asset loading outage earlier!) By Sunday evening we were able to re-stabilize the grid, and Second Life returned to normal for real.
It’s really interesting to hear April’s perspective on what went on and April mentions at the end of the blog post “My team takes the stability of the grid extremely seriously, and no one dislikes downtime more than us”.
One of the things I like about my job is that Second Life is a totally unique and fun environment! (The infrastructure of a virtual world is amazing to me!) This is both good and bad. It’s good because we’re often challenged to come up with a solution to a problem that’s new and unique, but the flip side of this is that sometimes things can break in unexpected ways because we’re doing things that no one else does.
I’m really sorry for how rough things were inworld this weekend. My team takes the stability of the grid extremely seriously, and no one dislikes downtime more than us. Either one of these failures happening independently is bad enough, but having them occur in a series like that is fairly miserable.
See you inworld (after I get some sleep!),
I remember the weekly downtimes Second Life had many years ago which lasted for hours and the old message “the grid is down while we bang on things”. Since then the grid stability has improved but things can still go wrong at any time and take everyone by surprise even after 13 years of Second Life being online.
Thanks to April Linden for explaining what happened during the weekend and apologising for the situation.