In-Depth

Be Prepared

Scout out potential trouble by listening to your servers’ sounds. The best plan, as always, is a current backup.


Many network administrators who have been in the game long enough tend to develop a motherly instinct when it comes overseeing their family of servers and workstations. They know when their machines are too hot, hungry (for more RAM), or sick. The sound a machine makes when it’s down and out is distinct. Sounds emanating from the server normally lead to a straightforward diagnosis, as there are so few moving parts: hard drive failure. The noise sometimes reminds me of a playing card in the spokes of a bicycle or an index finger in a moving CPU fan (though I shouldn’t admit to having done the latter). The symptoms aren’t always severe at first. Advanced operating systems like Windows NT/2000 can detect bad sectors on the drive and reroute the location accordingly. But when the worst-case scenario becomes reality, you’d better count on a long night.

Overconfidence
I wasn’t prepared for such an evening when I drove six hours to a client site to upgrade its SQL Server from 6.5 to 7.0. In fact, because I’d done at least 10 upgrades of the same kind, using the upgrade wizard provided in SQL Server 7.0, I was expecting to roll out early and spend a relaxing evening studying in the hotel room for an upcoming exam.

After arriving, I made my introductions to the IT staff members, with whom I’d only had phone contact previously. We shared a few moments of “Hey, that’s what you look like!” before I started getting ready for the upgrade. I think they could perceive my confidence, as we’d been planning this for several weeks. We chose a Thursday evening so I could be on site Friday morning in the event of a disaster.

The Dreaded Clicking Sound
I remember seeing the entrance to the server room from about 15 feet away. As I approached the entrance, I felt something odd, like a premonition of a long night without the opportunity even to stop for a slice of delivery pizza—but I shrugged it off.

The moment I placed my right foot in the doorway, however, I heard it: The repetitive click of a small, incapacitated spindle arm banging against the metal surface of a disk drive platter. I looked at the now inquisitive IT staff members, who’d placed their entire trust in me to make their jobs and lives easier. Their questioning looks said, “I wonder what he’s going to do about this?” I smiled and said something that apparently only I found humorous: “So, you guys did make good backups like I asked, right?”

The more I worked on the machine, the more noise it made and the slower it got. I had to act fast.

The sickly machine, naturally, turned out to be the SQL erver system I was there to upgrade. I pulled up to the machine in a painfully uncomfortable rolling office chair, a place I would occupy for the next several hours. I was miraculously able to log in and navigate the directory structure, though the machine was crawling. The first thing I discovered was that there was no configured RAID (Redundant Array of Inexpensive Disks), hardware, software or otherwise. They’d gone with the antithesis of RAID—SLED (Single Large Expensive Disk). The drive was partitioned with C: and D: drives. The master database was set up on C: and all the user databases on D:. SQL Server seemed to be running fine but I knew it was only a matter of time—potentially only minutes—before we might never boot again. The more I worked on the machine, the more noise it made and the slower it got. I had to act fast.

If at First You Don’t Succeed…
They’d purchased a new server that was going to be the recipient of the upgraded databases, once the conversion was finished on the now-crashing server. Both were running NT 4.0 with Service Pack 6. They’d already installed SQL 7.0 with SP2 on the new server.

I faced several alternatives. One option would be to attempt a machine-to-machine upgrade by connecting, via the network with the upgrade wizard, to the old SQL Server. Another course of action: Remove SQL 7.0 on the new server, install SQL 6.5, copy over the database files to the new server and perform a single-machine upgrade. Either option would require pulling very large amounts of data—a gig and change—from the ailing system. I decided to walk down the machine-to-machine upgrade path first. About 10 minutes into the process, when I was just starting to believe it would work, the old server hung. I waited for any sign of life, then decided to take the only option left and bounce the old server. Another small miracle occurred when it actually rebooted successfully and the SQL services started! So, scratch plan A and move to plan B, the single-machine upgrade.

I’d learned a trick when moving a full SQL Server directly to a new machine without having to restore databases individually, one of which I’d employed several times in the past. The procedure’s simple: Install the same version of SQL Server on another machine. Stop the SQL services on both machines. Copy all the SQL database and log files, like master.dat, into the same location from the source server to the destination server. If master.dat resided on C:\MSSQL\Data, then that’s where it has to go on the new server. The master database contains information about all the databases, users and logins on the server. With all the files in place, take the old server offline, give the new server the same name as the old box and change the IP address. Restart the SQL services on the new machine. If everything was done correctly, the new SQL Server would be identical to the old SQL Server. This was my new plan of attack.

Crash No. 2
I uninstalled SQL 7.0 and installed SQL 6.5 on the new server. It had been partitioned identically, so all that remained for me to do was move the data and log files. I connected to the dying server by mapping drives to the administrative shares on the C: and D: drives and began copying the files. In hindsight, I could have attempted to zip the files, but that would require even more HD activity. After another 45 minutes of copying, the old server hung again. Arggghhh!

Please Tell Me You Made the Backup
This time I opted to restore from backup tape. I’d made the network manager at the client site promise me he’d make backups of everything, including the raw data files, before I began the upgrade. He’d stopped the SQL services so the files wouldn’t be open and subsequently skipped during the backup process. I was able to pull the backed-up files from the tape and restore them on the new SQL server, which was now a SQL 6.5 machine. I started the services after powering down the old machine for the final time, and everything started successfully. All that remained was to reinstall SQL 7.0 and complete the upgrade process for all the databases. Thankfully, it went off without a hitch.

As it was nearly 3 a.m. before I finished, I headed back to the hotel. I was so geared up, I did actually study for a few minutes. The users showed up the next morning, rested and eager to experience the promised performance gains. I made sure they were all content and then headed home for a relaxing weekend, remembering to thank the real heroes who’d saved the day with one backup tape.

MTBF Means Just That
Every hard drive comes with an MTBF value. MTBF stands for Mean Time Between Failures and is measured in hours. Though today’s hard drives have values in the hundreds of thousands of hours, just knowing the number exists is food for thought.

Remember, though, that there are many resources and technologies out there to prevent hard-drive catastrophes, or to at least provide the minimum downtime. Thus, there are really no excuses for not protecting your data. Technologies like Intelli- Mirror, clustering, Remote Installation Services, disk imaging and single-disk recovery procedures that come with many backup applications offer varying levels of protection.

There are also companies that provide services to yank data off drives seemingly beyond repair. Many of these resources are expensive, however, and not all companies see the value in investing in them. Having seen my share of crashes, and with the plunging prices of hard drives, I’d recommend at the very least using the software mirroring available with Win2K in conjunction with a solid tape backup plan. In the end, the time and cost involved with rebuilding and restoring a server, especially if there’s significant data loss, would likely pay for the ultimate addition to my server family: a twin pair of quad Xeon, load-balanced, RAID 10, hot-swappable cluster servers with a solid tape backup plan.

Featured