As promised, the following is a summary of what led up to the crash that kept us down for almost four days last week.
About two weeks ago, I started to notice problems with the What The Christ? server. It was mostly on the administrative side of things at first. After a few days, however, we got a string of random reports from visitors who couldn’t connect. It took a day or two before we realized that the problem was an “Error establishing database connection.” message.
That’s the error that WordPress throws when your server can’t establish a connection to your MySQL database – the point from which content is delivered to the user. It’s usually caused by one of two things: a problem with your WordPress configuration file or a (much bigger) issue on the MySQL side of things.
Configuration issues are easy to fix. Unfortunately, we weren’t facing a configuration issue. No, we were facing a much bigger problem. In fact, Mr. Blue said it best.
“Dude, the database is fucking hosed.”
Don’t ask me how it happened. It could have been our migration to WordPress 3.0.1 a few weeks back. It could have been a plug-in that we were testing. It could have been one of a handful of configuration changes that we made at the host level. Whatever it was, we were looking at a really ugly problem. Our databases are not small, nor had we done as much of the maintenance on them as we should have been doing all along. To be honest, we’d been neglecting them because we were way too busy updating and promoting the site.
Needless to say, we were faced with a tough decision. We could either bring down the site, which would have been bad, or we could have tried to troubleshoot the problem live, which would have (potentially) led to downtime anyway. Obviously, we chose to troubleshoot the problem live. About an hour into the process, the server took a giant shit and dumped our progress. (Hooray, ambiguous Plesk error!) Down we went – and down we stayed. I could detail the laundry list of problems we had trying to bring things back online over the two days that followed but I’d rather not. It was a nightmare.
Unfortunately, that was only the start of our troubles.
Initially, the crash didn’t seem like a big deal. I thought we’d be down for a day at most. If there’s one thing we’ve done “right” from day one, it’s our backup routine. We had a dump of even the most recent posts – both the MySQL data and the associated assets. I thought we were good to go. In fact, it seemed like we were good to go after a couple of days sitting around trying to build a better mousetrap. We eventually decided that the best plan was to wipe the server. I reinstalled everything from scratch and started the process of uploading assets to the server. Meanwhile, Blue opened a case of beer and a book on MySQL.
But then something happened. Something bad.
Early Saturday, the server simply decided to die. Period. Hardware failure.
I’m pretty sure anyone within a six block radius of my house heard me scream, “FUCK!” when it happened. Everything shut down. No SSH. No FTP. A moment later, I was on the phone with our host. Thankfully, they understood my frustration and didn’t hold it against me. Instead, they got right to work trying to figure out what the hell had happened.
Ten hours later, the problem was solved. (In case you’re wondering, we didn’t move things to an interim server in the meantime because we figured that our server would be back up and running before we were even able to get everything uploaded on the temporary box.)
From there, the rebuild was simple… but tedious. I managed to tie up four Comcast lines, three of which belong to friends, for close to nine hours while we uploaded assets and ran a much needed database optimization routine. We also cleaned up the SQL configuration and learned a thing or two about responsible administration. (Thank you, mysqltuner.pl.)
So, there you have it. That’s how it went down – pardon the pun.





Well, I feel like a noob now. Apparently I don’t know jack shit about maintaining a website. I’m glad everything got resolved.
I’ve learned more about computer hardware and best practices of server administration from running this website than I ever could have learned from a book or even a computer science course. There’s nothing like “on the job” experience.
at least its back up and kicking ass now. Thanks for the hard work, but i wont lie and say i didnt check an obscene amount of times while it was down. wtc is like crack
Dudes, I feel your pain. I quit HP a few months back after being one of their cluster and performance “escalation engineers” for the HP IT internal database team. When stuff goes down and you’ve got *thousands* of people wondering WTF is going on, the pressure is intense. Thanks for the hard work, it’s appreciated. You guys are doing a kick-ass job on quite possibly the best website to come along in a while. And yeah, WTC is total crack for me too.
You should write a book, Pink. It doesn’t matter what it’s about. I would buy it.
I love how you tell stories. They are very engaging.
gg
I too checked this place multiple times per day. I found myself more than a few times sitting in class wondering, “I hope WTC comes back up soon. I hope everything is ok. Where am I gonna get quality jerkin’ material?” Seriously, this website kicks so much ass it is just silly. I hope you guys are proud of what you’ve made, because you should be. Bravo for all you’ve done.
great stuff guys… thanks for making the site back to life. i already started to have tremors tics and spasm.. life without WTC is tough… ughhhh…. addiction….
you got the server at home? how does that work?
personally i wasn’t bothered because i understand how it is when things go pete tong.
i’m glad you didn’t lose your baby man, and welcome back
I was getting the error establishing database connection on and off for a couple weeks, I chalked it up to implementing the new changes. I had no idea things got so fucked, yikes. I agree with Zakath too, you write well and have your own distinct voice. And puns never need to be pardoned, imo, i even love bad ones (which that one wasn’t)
Protip: Don’t test new shit in production.
Born: We own the servers but the host stores them for us and maintains our connection. I can SSH, FTP, etc. into the boxes any time I want to, of course.
wtcfan: We don’t have to anymore, now that we have an instanced backup server. And to be honest, it wasn’t really an issue of testing anything in production. It was a matter of a bad hard drive and slightly mismatched memory pairs, as far as we can tell.