In-Depth
9 Troubleshooting Tactics
A best practices guide that'll turn you into a troubleshooting efficiency expert.
- By Anil Desai
- July 01, 2002
Troubleshooting comes in all forms and sizes, but it's always an important
task for IT. In some cases, you may be trying to solve a driver conflict
for a single user, and in another you may be trying to solve a performance
issue with an Enterprise-level application that affects thousands of users.
For many of us, the true fun (as well as angst and frustration) of being
in IT stems from the task of troubleshooting.
Many IT staffers handle troubleshooting in an ad hoc manner. That is,
they may react differently to the same problem each time it occurs, and
there are no established practices for handling difficult issues. Often,
they'll try and retry the same solutions, making even simple tasks seem
mind-numbingly difficult. If you've taken any of Microsoft's certification
exams, you know that your ability to troubleshoot problems is important.
It's no less important in the real world.
Though the problems can vary dramatically, some basic troubleshooting
techniques will help you find the most efficient path to solving a problem.
In this brief article, I'm going to give you my troubleshooting tactics.
Instead of diving into technical details (things like using PING to ensure
that network devices are responding), I focus on more general principles
that can be applied to many different types of solutions (technical and
non-technical).
With that goal in mind, let's look at some common troubleshooting best
practices.
1. Identify The Desired Solution
OK, maybe I'm starting to sound like a Windows NT 4.0 exam here, but bear
with me. It's important to figure out where you're going if you're planning
eventually to reach a destination. If you ask users and business managers
about optimizing performance, they'll often state that the goal is to
make things run "as fast as possible." Though it might sound like a worthwhile
initiative, it's rarely practical. First, most systems really don't need
to achieve the theoretical maximum performance. Second, the cost to achieve
maximum performance would probably be prohibitive. And how will you ever
know what the actual "maximum" is if things can always be improved? A
better idea is often to settle for the not-so-ambitious but much-more-practical
goal of "good enough." You may state, for example, that Web-based reports
should be returned to the user within 30 seconds. Or, that network logons
should complete within 45 seconds, regardless of network load. With these
well-defined goals in mind, you can completely address issues to everyone's
satisfaction.
2. Fully Understand The Problem
All too often, I've seen IT staff jump into finding the solution for a
problem without fully understanding the issue. If a person complains about
a problem with a soundcard on her desktop machine, you might be inclined
to reinstall various drivers. But what if the issue is that there's too
much bass output when she uses headphones? Or users might complain about
performance issues related to accessing a particular resource (a corporate
database server, for example). Although that's their original complaint,
it's quite possible that the real issue is at the network level or perhaps
even on the client side. By asking simple questions, such as "Who's affected
by the problem?" "Are the problems repeatable or intermittent?" and "When
did the problems begin?" you can gain significant insight into potential
remedies. I can't stress this one enough: Be sure you fully understand
the problem before looking for solutions.
3. Define Metrics and
Make Repeatable Measurements
Suppose you're troubleshooting an intermittent issue. These tend
to be particularly frustrating, since it's difficult to know whether or
not you've solved the problem. It's important to establish some kind of
metric. For example, if a user says that logons take "forever" on Monday
mornings, make them quantify this. If the logon takes 30 seconds, then
perhaps it's more of a perception issue. If it takes 10 minutes, you've
got a major problem somewhere. The important thing is that you have a
way to measure the problem. Now, when you make changes, you can go back
to these original measurements to see if you've made a difference. The
approach that you take to resolving the problem should be based on this
information.
4. Document Your Troubleshooting Efforts
You've heard that those who don't remember the past are condemned to repeat
its mistakes. The only thing more embarrassing than spending hours troubleshooting
a simple issue is doing it more than once! OK, perhaps retracing your
steps (running around in circles) is just as bad. To avoid such problems,
be sure to document the steps you've taken to troubleshoot an issue and
make sure that it's recorded for later use.
5. Make One Change at a Time
Scientific practices dictate that you should minimize the number of affected
variables when you're trying to measure the effects of a change. Imagine
this: You make several changes to a system at once in an attempt to affect
overall performance. You're happy to find that overall performance has
improved by 25 percent. However, unknown to you, some of the changes improved
performance while others decreased it (see figure). You've improved performance,
overall, but clearly you could have done a better job. The ideal solution
is to make each change independently and then test each it. It's definitely
more time-consuming up-front, but in the end, the results can be considerably
more valuable.
|
A simple troubleshooting effort that introduces multiple
changes at one time. Although the overall effect is a performance
boost, some of the changes actually reduced performance. |
6. Prioritize Your Issues
Good troubleshooters are often just waiting for the next challenge to
stump them. In an ideal world, we'd have to focus on only one problem
at a time, but few of us actually have this luxury. More often, you face
multiple problems, all of which are important. In such cases, your first
step should be to prioritize them. It's difficult to work on all of the
issues efficiently at once, and task-switching can take significant resources.
Apart from being frustrating and stressful, you'll have a hard time focusing
on any of the problems. Instead, make a "hit list" of items, based on
their importance and start knocking them out one by one. The small victories
along the way will also help you know that you're making progress!
7. Prioritize Potential Solutions
If you're trying to find your way out of the woods, you're likely to have
many different paths to take. Sometimes you'll have several hunches about
what will solve a problem, but you can't try them all at once (remember
the previous rules). Potential solutions will have different factors,
but the most important ones to consider include the likelihood that it
will solve the problem and the amount of effort required to implement
the solution. In some situations, you may choose to start with the simpler,
easier solutions first, even if they're less likely to solve the problem.
Or you may choose to bite the bullet and go for the most likely solution,
regardless of the amount of effort required. Having multiple teams tackle
the possibilities can also be helpful. All of this is doable only if you
first organize and prioritize your potential solutions.
8. Make a Business Case for
Troubleshooting Each Issue
Most of us are used to justifying costs related to longer-term projects,
but the same principles should apply to troubleshooting. Here's one that
most of us techies probably don't like: Sometimes the best "solution"
might be to throw in the towel, give up, and live with a problem. I must
admit that it was frustrating, but such was the case for an irritating
intermittent issue I faced a while back. The problem seemed to be related
to memory leaks in a third-party application that eventually led to a
corruption of the network stack on critical servers. After considerable
unsuccessful troubleshooting, we determined that it wasn't worth the additional
effort to continue trying to solve the problem (the problem was rare,
and we had much higher priorities). It was difficult to peel the team
away from the issue, but we had been neglecting other priorities while
we tried to hunt down our Moby Dick. It turned out that the costs related
to solving this issue couldn't justify the potential benefit. The best
"solution" was none at all. This may not be the case often, but you should
always keep track of the amount of effort you're exerting to solve a problem,
in order to keep costs in control.
9. Follow The Rules When You Can,
But Make Exceptions When You Must
I'll concede that there are certainly circumstances in which all of my
guidelines might not apply. For example, suppose you find that a critical
production server is misbehaving, and no backup system exists. The goal
is to get the machine up and running as quickly as possible, at any cost.
In this case, you might take some risks. For example, you may make several
changes at the same time in the hopes that the changes are independent
and that one will solve the problem. Or you might be forced to investigate
an urgent issue before you have the complete details. With that said,
however, remember that the practice of ignoring the rules is for exceptions
and it should be used only in a pinch! I trust that you'll find these
nine techniques useful the next time you're trying to tackle a sticky
issue. If you use an organized process that employs best practices, even
the most annoying, frustrating and complex issues can be reduced to a
simple, effective process.
Good troubleshooting!