We’re back! The entire OSAF/Chandler world was offline for about 24 hours from April 27th at 2pm to April 28th. This outage has been fixed with no permanent damage.
The root cause was “sparking” (ouch) on the power lines reported by PG&E. Our colo hosts, Hosted @ ISC, switched over to their generator as a precaution but the generator failed. ISC started shutting down machines when they knew the outage would exceed their UPS power.
Here’s where our outage stretched out longer than it needed to. ISC didn’t let us know they were powering down the machines, or that power was back up a couple hours later. They responded to my 3pm inquiry to their ticketing system late on the 27th, saying that everything should have been fine hours ago. Unfortunately, I had already gone to bed, so the OSAF/Chandler outage had to wait out the night.
In the morning of the 28th, I asked ISC to hit the power buttons on the machines for me, but nothing happened. I packed up quick tech kit plus some spare machines and hopped in the car for the hour drive to Redwood City. On-site around noon, I confirmed the machines appeared dead. Weird. Hopefully the power issues hadn’t killed all 4 physical machines’ power supplies or motherboards at once, right?
Turns out, our managed power device that lets us turn machines on and off remotely had taken a header. I started moving machines’ power around, finding out by the end which ports on the power switch were dead. The Hub machine moved off the power switch entirely.
That would have all be relatively simple, except for the extended duration, but there were other post-shutdown details that smacked me around for a while. I found that the tightened DNS configuration implemented back when DNS security went crazy kept our DNS machines from answering questions from themselves. I kept production machines like Hub off until I had DNS sorted out.
But far worse was a surprise related to the Debian Etch to Lenny upgrade. Since Lenny was released a few months ago, I’ve been upgrading OSAF machines in the background. I don’t know how I missed this, but the upgrade can remove the package used to bring up networking (ifupdown). I had been breaking the cardinal sysadmin rule of always rebooting machines after upgrading the OS, so I was very very confused when some machines came back up without any networking. Including our primary DNS server. And the secondary DNS server seemed broken because of the DNS config error I mentioned above.
I shot myself in the foot even more because before I realized all this, I had decided that the already-extended outage “seemed a good time” to upgrade the Hub from etch to lenny. After the reboot, no networking! Gah, it was just working fine before the upgrade!
So, after getting a grip on what was going on, tracing through networking startup scripts, and tracking down the missing ifupdown package issue, I go bumming through the ISC office for a USB key. I use one of the existing machines to pull the ifupdown package (both i386 for the virtual machines and amd64 for the Hub) using a command-line web browser, troll through syslog to figure out what device to mount, and get these packages onto their needed machines. Luckily, this all works and all machines come back up.
All’s well that ends well I suppose, but this was a tough outage to swallow after the 9 hours from the big fiber cut less than a month ago. Natural questions like “is our hosting good enough?” and “should we move?” come up.
My view is that weird things happen in every hosting environment; moving is not a cure-all for reliability issues. Our reliability isn’t as good as we’d like or as good as could be achieved, but I feel it’s still better to sit tight. The main reason is cost: hosting the Hub consumes a good amount of bandwidth (about 8Mbps) and ISC is providing all services (space, power, bandwidth, remote hands) for free for 9 rack units worth of equipment. It would be possible to move some services like mailing lists, code repositories, and wikis to other free services and someday the community may choose to go that route. But it’s nice to not have restrictions on capabilities or capacity that we get from hosting almost all of our own services; many open source projects would love to have access to the resources and flexibility that OSAF enjoys.
Overall I think we’ve just had a spate of bad luck and while a very rare 24-hour outage might be unacceptable to a commercial venture, as long as they stay very rare, they are acceptable to OSAF and Chandler communities and very worth the tradeoffs.
There are three items I want to undertake as a result of the outage:
- Place a DNS secondary outside of ISC
- Print out some phone numbers, IP ranges, and other “might need it offline” info
- Talk to ISC about coordinating during outages
As a final note, it turns out I could have determined that “everything should be fine now” by going to status.isc.org where there was some information about the outage, including when it ended. I hadn’t been checking that page during the outage but I certainly will (as well as using the phone as needed) during any future outages.