Mother the coop server went down the third time unintentionally at 1100 hours this morning since it went up to the rack about nine months ago. By around 1300 hours it was obvious that it was no network glitches. I suppose I should have gone into the data centre, plug a monitor and a keyboard into mother and try to find out what happened, but I have an engagement for the afternoon, so I decided to see if we could get the service back up by rebooting.
It took about half an hour for mother to check and repair her four RAID partitions, then she came up alright. I checked the logs and it seems that mother simply hanged. I also checked for any signs of mother getting r00ted, but she seems to be okay.
James raised the possibility of a failure backup solution again. Definitely something to think about.
2 responses so far ↓
1 James Mok // May 30, 2004 at 1:57 am
Does the log keep track of all the access with timestamps? I mean is it possible that the P-ATA drives on Mother does not like the occasional sudden surges of large number of major multiple access had we just happened to have done something at exactly the same time?
2 tin_the_fatty // May 30, 2004 at 8:34 am
The PATA drives might be cheap, but they should be able to withstand quite a bit of abuse. I checked the logs. There is no sign of heavy access. As I didn’t have packet logging turned on, I couldn’t check whether we were under a DOS attack http://en.wikipedia.org/wiki/Denial_of_service, althou if we were, mother would go down again in no time.
Which reminds me that I should turn on packet logging.