Geek-speak warning.
Ouch, yesterday was painful. As I mentioned in the last post, the server which runs the FloodRiskNet web site and more besides died. I spent almost all day (from morning coffee time when I rebooted the server until eight in the evening when it next booted successfully from its hard drive) panicking, running up and down stairs with various rescue CDs in the hope that one of them would provide me with some useful tools for tracing and fixing the problem.
The symptom: The machine would start, the boot loader would run, the kernel would load, and then it would freeze just after printing
Freeing unused kernel memory: 130Kb
With the help of Google, I found out that this is the point at which init starts. Init is the first process started on a Unix box once the kernel is running. It took a while from that discovery until the penny dropped that this probably meant that there was an incompatibilty between the kernel, which was old, and init, which was relatively new because I was keeping the system software up to date.
The kernel was old because I am using the XFS filesystem, and have only recently gotten round to working out how to build patched kernel packages in Debian. Since the stock kernels do not include XFS support, I couldn't upgrade the kernel along with the rest of the system using the standard kernel packages, and I was always intending to work out how to build one but not getting round to it.
I might have found out earlier that there was a problem but for the fact that it isn't necessary to reboot Linux for much. It continued running quite happily through several upgrades, but, I presume, although init was upgraded it was never restarted, so the new version wasn't tested. It was therefore only when I rebooted that the problem manifested itself.
Once I'd tracked down the cause of the problem, the fix was relatively easy, although it necessitated a bit more learning of the workings of GNU/Linux. In particular I was very glad that I'd heard of chroot some time, because if I hadn't I still wouldn't have a working server. With it I was able to boot using a rescue CD, mount the hard drives from the system in the correct configuration, and then chroot to the system's normal root directory. That allowed me to use the system package management facilities to install the new kernel package that I'd prepared (copied over on a 1680Kb formatted floppy — that took a while too, because I only have old and dodgy floppies lying around).
That done, it booted right up no problem. And I did a little dance, which was fine because it was 8 o'clock and no one was there to see me.
The moral of the story is probably don't put things like kernel maintenance off. Hmmm.
A twist in the tale is that I forgot that shutting down a system booted from a rescue disk wouldn't unmount any filesystems which had been manually mounted. This meant that at some point the system XFS filesystems were unceremoniously turned off. Now XFS is a journaling filesystem. The whole point of using it was reliability; if the system is turned off without unmounting the filesystems, then there is a journal on the disk which can be replayed. There is a much lower chance of corruption. Rather upsettingly, I ended up having to run xfs_repair and force it to fix one of the filesystems because it was damaged in a way which meant it wouldn't mount (and therefore couldn't replay the log). Precisely the situation I was trying to avoid by using XFS.
Recent Comments