In-Depth

Tales from the Trenches: Screams from the Server Room

Mix a $50,000 server, NetWare, and a nitwit consulting company to get a true recipe for catastrophe.

A few years back I ran a network at a local high school, a school that was heralded as a “lighthouse” project to show how computers could be used in education. It was indeed sophisticated for its time—the network consisted of 350 diskless workstations booting Windows 95 from the server, 25 laser printers, seven file/print servers running Novell NetWare 4.0, and an IBM AIX server for Internet and Web capabilities. All the systems were networked using both copper and fiber token ring, and nearly 80 percent of the library’s resources were online (all this for about 1,000 students).

However, the company that had originally installed this network had done such a terrible job that three years later, when I started, I was still fixing installation and configuration problems. More about this company shortly.

Upgrade Problems

After running the network for about a year, we decided that a new, powerful server was required to replace the aging administration server. Since the school board decided that the only systems we could purchase were from Compaq, we opted for a high-end ProLiant server. Unknown to us, the company that had originally installed our network was the same company (under a new name) as the one selected as Compaq’s official vendor. Because of the way our network was configured, each of the seven main servers had two token ring cards in them and acted as routers to shuttle information between the different networks. Since all our servers were IBM systems and had only MicroChannel as their data bus, we opted to let the “trusted” company make the decision on which to purchase. They recommended IBM bus mastering cards, priced at over $1,000 each.

During the summer the server arrived, and I installed it with Novell NetWare 4.11 in place of the existing administration server. The server ran for more than a month—that is, until teachers and administrative staff came back from the summer break. At that point, one of the server’s hard drives failed.

I quickly called in support, and they had someone at our door within 30 minutes informing us that he was here “to replace the failed hard drive on the ProLiant server.” Wow, great service, right? Wrong! He walked into the server room, which contained eight IBM servers and one Compaq server, and asked, “So which one needs the hard drive replaced?”

At this point, I left the room. After putting out a few fires and explaining why our $50,000 server was down, I returned, only to find that he hadn’t figured out how to remove the old hard drive yet. It eventually took him six hours to replace the hard drive.

I stayed through the night to get the OS, applications, and data files installed on the server so that it could be used in the morning. I personally watched the server console the next morning as user after user logged on and started working. Life was good.

Later that morning, my pager beeped as I was getting into my car to go home and get some sleep. The server had crashed!

Back in the server room, the system was so locked up that the soft power switch on the front of the machine no longer worked. Mumbling words that can’t be repeated here, I reached behind the system and rebooted the server. My mouth dropped when I got an “invalid system disk” message. I then attempted to recover the system using all the tools in my arsenal, such as Norton Utilities, but it was gone, dead, a doorstop. At this point, I formatted the hard drive and started over, thankful that I had the hindsight to back up the system before letting users connect to it.

With the system back up, I watched as user after user connected and started working. When the seventeenth user connected, the system crashed again. Rebooting the server returned the same error as before, and we were back to square one.

Top Gun in Action

At this point, the only idea I could come up with was that this was a hardware problem. The vendor was again contacted and this time made aware of the consequences should they send the moron they’d sent before. This time, the company’s top NetWare/network consultant arrived to fix the problem. His first question: Could we duplicate the problem?

“Just use the system and it will crash,” I replied, adding “I think it’s a hardware problem.”

His response: “No, the hardware is fine. It’s your configuration that’s at fault.”

“Where’s the window in this place?” I mumbled.

“I thought you were running NetWare,” he said.

“Never mind,” I said as I walked out. I was in the process of informing my supervisor that I believed it was a hardware conflict between the system and the IBM token ring cards and asking if installing a window in the server room was in the budget, when the consultant found me. “Your NetWare installation is flawed. I’m going back to the office to get our testing copy. Back in a flash.”

While he was away, I went home and brought over a Pentium PC that was idle at the time. I decided to install Windows NT Server 4.0 on the system, since this was the standard the board of education was recommending. (I’d been on the team that had evaluated and recommended NT). I installed two Madge.connect token ring cards along with NT, the applications, and the data files. Since there was no way this system was going to remote-boot the 35 admin workstations (NT doesn’t do diskless stations well), I installed Windows 95 on them and had the admins and teachers back on the network within 24 hours.

Software Theft

The following morning, the consultant berated me. We’d apparently “stolen” their copy of NetWare. Unlike the “we trust you” licensing with NT and Windows 2000, NetWare broadcasts its presence on the network. If another copy of NetWare is running with the same serial number, broadcast messages start to appear everywhere on the network. What had happened was that as soon as the consultant had installed his copy of NetWare, the system complained that an exact copy was installed on our CD-ROM system in the library.

“How could we possibly have done that?” I asked.

After a bit of detective work, we realized that this new company was the same one that had originally installed the network. They’d installed a 150-user copy of NetWare on the server and were told that it wouldn’t be sufficient. A 250-user copy of NetWare had been ordered and they’d left their copy installed in the interim, then hadn’t bothered to come back and re-install the newly ordered software. Mystery solved, but problem still present.

Over the next month, the company had specialist after specialist look at the system. Network captures were done and sent to Novell, Compaq, and IBM tier 1 support. No one seemed to know what the problem was. Even though this wasn’t affecting the users (after all, they were working off the Pentium server), spending time working with the consultant was definitely slowing me down. Fed up, the school informed them that if they couldn’t stabilize the system within five days, they were to take the system back and expect a bill for my downtime.

In an apparent bout of desperation, the consultant actually asked me what I thought the problem was. Again, I said, “The token ring cards.”

“That’s not possible,” he said.

“Just humor me and try it,” I said.

The Climax

Two $300 cards were ordered and installed. We brought the system online and allowed users to connect to it. The 20th, 30th, 50th, 100th user connected without a problem. The consultant had nothing to say and left about an hour later. Since the admin/teacher team got to like the way NT did things, that afternoon I formatted the drive and installed NT on the ProLiant. We had no problems with the system for the duration of my stint there, and the company in question was removed from the approved vendor list.

About a week later, I got an email from the consultant. He found out there were no problems when a single IBM card was installed in a ProLiant, but the system would crash if a second card were installed and routing turned on. I was also informed that Compaq had known about this problem for about four months.

What did I learn from this experience? I no longer rely on other company’s suggestions for hardware. Although it tends to be slightly more expensive at the onset, evaluating the hardware beforehand can save a considerable amount of time and money. Also, there are many so-called “experts” in the IT field today. I wouldn’t hesitate to ask for resumes and references before allowing anyone to work on any of my systems.

Featured