Worst day of my career

I think I had just about the worst day of my career today. On Saturday morning during the first of the raid container rebuilds the file system on the file server decided to get corrupted. But of course with very few people accessing the server we didn't know about it until this morning, when one group of users couldn't get to their directory at all. After we checked the event logs and saw what was happening we thought we should do an emergency shutdown and chkdsk so that the affected users might be able to do some work that day. However the server decided it wanted to reset the permissions on every file on the server. All one million of them. After two and a half hours and 150000 files we decided we'd risk losing everything and reboot out of it. The server came back up, but all the file permissions were utterly screwed.

So while everything was so fubar, we decided to go back to the ultra 160 emm card and cable, because we knew that they at least worked. Plus dropping back to 160 from 320 we may eliminate some of the flakyness in the raid containers.

So after replacing the card and rebooting, it was then a matter of resetting all the permissions. Fortunately I'd only recently done a bunch of permissions scripts and consolodation of our documentation, so it wasn't such a difficult thing to reset all the permissions on the box to what they should be. But it did take many hours.

I still have a few major directories to go, and then I have to figure out which files were corrupted and try and restore those from last week's backup tapes.

The irony of all of this is, the backup tonight is running faster than it's ever run before on the new server. And this is on a "slower" emm card. Which may tend to suggest a problem with the 320 emm card or the new cable. Who knows. Dell certainly don't. Before all of this started I talked to them this morning, and they were going to escalate it to their second level support. Well they called back this afternoon and I kinda let them have it somewhat. Well not badly, I just told them what I'd gone through all day and that we'd gone back to the old card. He offered to send out a tech and I said well that would be somewhat difficult on a production server to "fiddle". I want to get the machine settled a bit more, and have some sort of way to recover to another machine if necessary. This of course involves buying another server, and replication software that I haven't had time to research yet :/ Blah.

Urgh. What a horrible day. It's the biggest unplanned outage I've ever had at work. Very distressing.

I was going to go to bed, but I have this urge to watch the backup finish, so watching Peter Pan to recover my mood somewhat.


Dave2 said:

Dell actually has pretty good support... for their expensive servers. If you were working with one of their cheap desktop computers, your day could have been much worse! Have you looked into X-Serve? We replaced one of our dead Dell's with one and are seriously in love with it. It can hold two mirrored raids, but now we're looking at the external X-Serve raid, which is about the cheapest storage we've found.

October 18, 2004 11:53 PM


