In-Depth

Tales from the Trenches: Critical State

Sometimes you need to troubleshoot the people before the technology.

While at a client site one day, an alpha page came in from a natural gas trading firm we support: “John, our NT server running our Oracle 8.0 database and supporting our trading floor is down. It’s stopped at a blue-screen, and on every reboot, it comes back to the same blue screen. It says something about an inaccessible boot device. Call me.”

I got on the phone immediately, knowing that for every hour down, the company was losing many thousands of dollars. After talking to the customer for a couple of minutes, I determined this wasn’t something we could fix over the phone. I immediately excused myself from the current job and headed over to the trading firm.

The server I was going to work on was a Hewlett-Packard NetServer with dual Pentium Pro processors and 256M of RAM. Additionally, it had four 4G SCSI hard drives, in a RAID 5 configuration, with a hardware RAID controller. A powerful advantage of a hardware RAID controller is the fact that it allows you to have a fault-tolerant system, but also lets you work with system partitions from Window NT’s setup routine. If you use software fault-tolerance, such as the mirroring that comes with NT, you can’t work with system partitions in the setup routine without breaking the fault tolerance.

The Sordid Details
The system partition was formatted with NTFS, so I knew we couldn’t boot from a DOS disk to examine the files on the hard disk. The server also had a 12/24G tape backup running Computer Associates’ ARCserve backup software, so I was hoping we had a good backup to restore with, should that be necessary.

The server’s state was critical when I arrived: it was at the “blue screen of death,” with a “STOP 0x0000007b” “Inaccessible Boot Device.” I decided to do a parallel install of NT Server, allowing me to examine the integrity of the start-up environment. Before beginning the parallel installation, I gathered the proper driver for the RAID controller. The moment the setup started, I began pressing the F6 key. This allowed me to install the RAID driver and have NT recognize the controller on restart. This avoided another instance of the “Inaccessible Boot Device” message. While inspecting the startup environment, I found the boot.ini file to be pointing to the wrong disk. A normal boot.ini file looks like this:

[boot loader]
timeout=10
default=multi(0)disk(0)rdisk(0)partition(1)\WINNT
[operating systems]
multi(0)disk(0)rdisk(0)partition(1)\WINNT=
"Windows NT Workstation Version 4.00"
multi(0)disk(0)rdisk(0)partition(1)\WINNT=
"Windows NT Workstation Version 4.00
[VGA mode]" /basevideo /sos

Ours was like this:

[boot loader]
timeout=10
default=multi(0)disk(4)rdisk(0)partition(1)\WINNT
[operating systems]
multi(0)disk(4)rdisk(0)partition(1)\WINNT=
"Windows NT Workstation Version 4.00"
multi(0)disk(4)rdisk(0)partition(1)\WINNT=
"Windows NT Workstation Version 4.00
[VGA mode]" /basevideo /sos

Note the disk callout difference in the third line? I quickly changed the boot.ini to correct the problem in the disk callouts and restarted the server. The server immediately rebooted into the existing operating system! We celebrated briefly and continued into the OS. We then we had some services fail and saw several shared folders that no longer existed on the server. It turned out the customer had tried an emergency repair on the server, restored a registry from an old emergency repair disk, and didn’t tell me!

Our only hope at this point was to extract an updated catalog from the backup tape and restore our server, including the registries. It turned out we had a good backup from the previous night.

The restore operations stated success, so we restarted the server and held our breath yet again. The server came back up! We rechecked the integrity of the restoration and boot.ini file. Everything looked good so far. We then tested the integrity of the Oracle database. The clients were able to attach to it successfully! We took the clients back off-line, installed the OS service packs, and reapplied the Y2K updates. Again, we had the server restart. Everything started successfully, and clients were able to attach to the database.

Time is of the Essence

Things could have gone differently. If my customer had had an up-to-date emergency repair disk or hadn’t restored a registry during an emergency repair, we could have repaired the boot.ini file and had the system up faster. Every time you make a change to your disks or partitions or upgrade the service pack on your server, you need to be sure to upgrade your emergency repair disk.

Additionally, if you do need to do an emergency repair of a server, you should only restore a registry from your emergency repair disk as a last resort. It turned out the problem with the boot.ini file had been created when the customer had moved some partitions around to better use the RAID for the Oracle server.

Next, be sure to communicate all of the details of the attempts to repair the system with all members of the team attacking the problem. Often some of the attempts to repair a downed server aren’t always directed at the right source of the problem. When I’m working with some of our junior network people, I have to constantly remind them, I need all of the data to formulate a plan.

Last, always have a good backup, and monitor and audit it regularly. Take it from me: There’s no worse feeling than not having a good backup as your last line of defense.

About the Author

John T. Kruizenga, MCSE, has worked with computers and networking since 1988. He has designed and managed networks that incorporate VoIP and QoS, remote management, WAN integration, collaborative software, and Web integration.

Featured