In-Depth
Tales from the Trenches: Screams from the Server Room
Mix a $50,000 server, NetWare, and a nitwit consulting company to get a true recipe for catastrophe.
- By Barry Shilmover
- October 01, 2000
A few years back I ran a network at a local high school,
a school that was heralded as a “lighthouse” project to
show how computers could be used in education. It was
indeed sophisticated for its time—the network consisted
of 350 diskless workstations booting Windows 95 from the
server, 25 laser printers, seven file/print servers running
Novell NetWare 4.0, and an IBM AIX server for Internet
and Web capabilities. All the systems were networked using
both copper and fiber token ring, and nearly 80 percent
of the library’s resources were online (all this for about
1,000 students).
However, the company that had originally installed this
network had done such a terrible job that three years
later, when I started, I was still fixing installation
and configuration problems. More about this company shortly.
Upgrade Problems
After running the network for about a year, we decided
that a new, powerful server was required to replace the
aging administration server. Since the school board decided
that the only systems we could purchase were from Compaq,
we opted for a high-end ProLiant server. Unknown to us,
the company that had originally installed our network
was the same company (under a new name) as the one selected
as Compaq’s official vendor. Because of the way our network
was configured, each of the seven main servers had two
token ring cards in them and acted as routers to shuttle
information between the different networks. Since all
our servers were IBM systems and had only MicroChannel
as their data bus, we opted to let the “trusted” company
make the decision on which to purchase. They recommended
IBM bus mastering cards, priced at over $1,000 each.
During the summer the server arrived, and I installed
it with Novell NetWare 4.11 in place of the existing administration
server. The server ran for more than a month—that is,
until teachers and administrative staff came back from
the summer break. At that point, one of the server’s hard
drives failed.
I quickly called in support, and they had someone at
our door within 30 minutes informing us that he was here
“to replace the failed hard drive on the ProLiant server.”
Wow, great service, right? Wrong! He walked into the server
room, which contained eight IBM servers and one Compaq
server, and asked, “So which one needs the hard drive
replaced?”
At this point, I left the room. After putting out a few
fires and explaining why our $50,000 server was down,
I returned, only to find that he hadn’t figured out how
to remove the old hard drive yet. It eventually took him
six hours to replace the hard drive.
I stayed through the night to get the OS, applications,
and data files installed on the server so that it could
be used in the morning. I personally watched the server
console the next morning as user after user logged on
and started working. Life was good.
Later that morning, my pager beeped as I was getting
into my car to go home and get some sleep. The server
had crashed!
Back in the server room, the system was so locked up
that the soft power switch on the front of the machine
no longer worked. Mumbling words that can’t be repeated
here, I reached behind the system and rebooted the server.
My mouth dropped when I got an “invalid system disk” message.
I then attempted to recover the system using all the tools
in my arsenal, such as Norton Utilities, but it was gone,
dead, a doorstop. At this point, I formatted the hard
drive and started over, thankful that I had the hindsight
to back up the system before letting users connect to
it.
With the system back up, I watched as user after user
connected and started working. When the seventeenth user
connected, the system crashed again. Rebooting the server
returned the same error as before, and we were back to
square one.
Top Gun in Action
At this point, the only idea I could come up with was
that this was a hardware problem. The vendor was again
contacted and this time made aware of the consequences
should they send the moron they’d sent before. This time,
the company’s top NetWare/network consultant arrived to
fix the problem. His first question: Could we duplicate
the problem?
“Just use the system and it will crash,” I replied, adding
“I think it’s a hardware problem.”
His response: “No, the hardware is fine. It’s your configuration
that’s at fault.”
“Where’s the window in this place?” I mumbled.
“I thought you were running NetWare,” he said.
“Never mind,” I said as I walked out. I was in the process
of informing my supervisor that I believed it was a hardware
conflict between the system and the IBM token ring cards
and asking if installing a window in the server room was
in the budget, when the consultant found me. “Your NetWare
installation is flawed. I’m going back to the office to
get our testing copy. Back in a flash.”
While he was away, I went home and brought over a Pentium
PC that was idle at the time. I decided to install Windows
NT Server 4.0 on the system, since this was the standard
the board of education was recommending. (I’d been on
the team that had evaluated and recommended NT). I installed
two Madge.connect token ring cards along with NT, the
applications, and the data files. Since there was no way
this system was going to remote-boot the 35 admin workstations
(NT doesn’t do diskless stations well), I installed Windows
95 on them and had the admins and teachers back on the
network within 24 hours.
Software Theft
The following morning, the consultant berated me. We’d
apparently “stolen” their copy of NetWare. Unlike the
“we trust you” licensing with NT and Windows 2000, NetWare
broadcasts its presence on the network. If another copy
of NetWare is running with the same serial number, broadcast
messages start to appear everywhere on the network. What
had happened was that as soon as the consultant had installed
his copy of NetWare, the system complained that an exact
copy was installed on our CD-ROM system in the library.
“How could we possibly have done that?” I asked.
After a bit of detective work, we realized that this
new company was the same one that had originally installed
the network. They’d installed a 150-user copy of NetWare
on the server and were told that it wouldn’t be sufficient.
A 250-user copy of NetWare had been ordered and they’d
left their copy installed in the interim, then hadn’t
bothered to come back and re-install the newly ordered
software. Mystery solved, but problem still present.
Over the next month, the company had specialist after
specialist look at the system. Network captures were done
and sent to Novell, Compaq, and IBM tier 1 support. No
one seemed to know what the problem was. Even though this
wasn’t affecting the users (after all, they were working
off the Pentium server), spending time working with the
consultant was definitely slowing me down. Fed up, the
school informed them that if they couldn’t stabilize the
system within five days, they were to take the system
back and expect a bill for my downtime.
In an apparent bout of desperation, the consultant actually
asked me what I thought the problem was. Again, I said,
“The token ring cards.”
“That’s not possible,” he said.
“Just humor me and try it,” I said.
The Climax
Two $300 cards were ordered and installed. We brought
the system online and allowed users to connect to it.
The 20th, 30th, 50th, 100th user connected without a problem.
The consultant had nothing to say and left about an hour
later. Since the admin/teacher team got to like the way
NT did things, that afternoon I formatted the drive and
installed NT on the ProLiant. We had no problems with
the system for the duration of my stint there, and the
company in question was removed from the approved vendor
list.
About a week later, I got an email from the consultant.
He found out there were no problems when a single IBM
card was installed in a ProLiant, but the system would
crash if a second card were installed and routing turned
on. I was also informed that Compaq had known about this
problem for about four months.
What did I learn from this experience? I no longer rely
on other company’s suggestions for hardware. Although
it tends to be slightly more expensive at the onset, evaluating
the hardware beforehand can save a considerable amount
of time and money. Also, there are many so-called “experts”
in the IT field today. I wouldn’t hesitate to ask for
resumes and references before allowing anyone to work
on any of my systems.