In-Depth
Duel to the Death
This health system's organization had its migration strategy all mapped out. But then reality intervened, causing a major course-change.
- By Alan Knowles
- October 01, 2001
We have about 25,000 employees, served through
a multi-region WAN with a large number of NT 4.0
domains and NT 4.0 servers for hundreds of different
applications used in our health care system. Each
region had a master account domain and many resource
domains, as shown in Figure 1.
Some of the resource domains maintain trusts
to several of the regional master account domains
(MADs). The regions share a message domain that
contains Exchange 5.5 servers from all regions.
Unfortunately, there are California users in both
the HSOR and the HSCA domains. One of the advantages
of moving to AD will be the MoveTree utility,
which will allow us eventually to move these California
users to the CA domain after it's built.
We're planning this migration for two reasons:
- Outlook Web Access 2000. Many of our top executives
have been to briefings and summits where they've
seen both OWA 2000 and Exchange 2000 demonstrated
at great length. They want these capabilities—especially
the improvements in OWA and benefits for remote
workers.
- We have a huge user base of Windows 95 workstations
that need upgrading. We never switched to Windows
98 or NT Workstation, mainly due to budget cycles.
But we now have the funding to do massive upgrades.
Because we'll be upgrading these machines to
Win2K Pro, we'd like to take advantage of Active
Directory, especially policies and software
installation packages.
|
Figure 1. The company's
old network included a gaggle of domains,
with a multitude of trust relationships to
maintain. User domains are on top, with resource
domains below. |
Active Directory Design
Our first Win2K implementation technical meetings
were our AD architecture design meetings. In conjunction
with these AD design discussions were the DNS
discussions, tackling topics like: Should we use
the same namespace on the inside as the outside?
Should we use a "one-off" namespace
or a completely separate one? Which region has
control of the DNS namespace? It's amazing that
the simple design shown in Figure 2 took two full
days to be agreed upon. Our company has strong
regionalized presences in several states with
separate IT departments and NT 4.0 MADs. We used
an outside consultant, which helped us cut through
the political problems and kept the discussion
focused on the technical reasons for eliminating
some designs in favor of the final one we agreed
upon.
|
Figure 2. Each peer domain
in the new infrastructure is a separate tree
in the single forest. AD is the "empty"
root domain created first. |
The design has these advantages:
- Each region keeps separate domain security
policies and administration.
- It uses the empty place-holder domain as
the apolitical domain root.
- We're able to leverage the centrally applied
global catalog, schema changes and Exchange
2000.
- Peer use of the DNS namespace. Delegation
of subzones for each region's administration.
- Each region can make individual decisions
on issues such as whether to implement an in-place
domain upgrade or a migration to a pristine
domain and how to structure additional forest
elements below the first level.
Each of the domain names corresponds to the state
abbreviations except for the AD name, which is
arbitrary. Unfortunately, AD wasn't our first
choice for the abbreviation. We fell victim to
the well-known warning about having to rebuild
the forest when not sufficiently planned. The
original choice was the slightly more agnostic
DS. We had to rebuild the root domain after we
discovered that ds..org was already
in widespread use by another region as a Web URL.
Lesson learned: Make sure to have all the right
people included in the design meetings and thoroughly
check out the existing namespace prior to building
domains.
March 26, 2001: Day 1 of
the AD Pilot
Our main tasks: Get a new DC for the HSOR domain
physically set up and start the OS install from
the new server build document; and use the opportunity
to double-check the new server build document.
We met briefly after lunch with the core group
to develop a last-minute plan regarding the migration/administration
tool selection. Some vendors' quotes need reworking.
One came in at $1.9 million for the complete tool
suite! We can't present this to our management
and risk the humiliating laughter.
We agreed to proceed using a NetIQ migration
tool—Domain Migration Administrator—for
the first part of the pilot (since we get the
option to use this tool while using the Microsoft
Consulting Service (MCS)), but we'll keep evaluating
the migration and administration tools in our
lab.
Broken Tools
We tried migration suites from NetIQ, Aelita and
BindView. None worked adequately. The most problematic
area for all the tools that we tested occurred
during machine and profile migration. A high percentage
of test users weren't able to use the migrated
profiles, causing Windows to automatically create
a new profile during the login process.
During our early migration attempts, the NetIQ
machine migration wasn't consistent. We were only
able to migrate one machine at a time to avoid
errors. Testing also indicated that migrating
machines and machine profiles using NetIQ required
the use of an account—in the Local Administrators
group of the workstation—that has the right
to add machines to the new domain.
We encountered an additional problem with NetIQ
when both accounts in the source and target domains
were disabled after migrating the accounts. We
assumed it would only disable the source.
Impact
(1-10) |
Risk
(%) |
Risk |
Mitigation
Plan |
10 |
<> |
Upgrade process
corrupts the SAM database. |
Crate offline
backup/dress rehearsal. |
0 |
5 |
SAM already
corrupt. |
Corrupt data
wouldn't be accessible. Not different
than now. |
7 |
75 |
Imcompatibility
of software currently on DCs. |
Inventory.
Move software off DC to another machine. |
8 |
5 |
Resources
needed. |
May
need to send personnel to CA for two
DCs there. |
8 |
5 |
SAM becoming
too large. |
Upgrade all
DCs; problem eliminated. |
7 |
90 |
Unknown carryover
of "garbage" accounts (users,
machines, security groups that should
be cleaned up.) |
Clean up now. |
5 |
10 |
Current
DCs' hardware non-Win2K compliant. |
Inventory/update. |
9 |
10 |
Upgrade
impacts mission-critical processes. |
Test/research/dress
rehearsal. |
2 |
5 |
Clients have
hardcoded IP addresses to DCs. |
Research may
just jave to accept this risk and react
when (or if) things break. |
10 |
90 |
Political
pushback. |
Prepare and
document sound reasons for taking this
course; show how advantages outweigh
risks. |
10 |
100 |
Login scripts
and replication won't work. |
Eliminate
by configuring file replication service. |
1 |
25 |
Collapsing
resource domains later. |
Need to buy
management tools. |
|
|
Changing original
naming conventions. |
Need to communicate
and get agreement from other regions. |
|
|
Training for
admins on new tools (MMC) for account
management. |
Introduce
tools ASAP. |
|
Table 1.
Before switching gears to the in-place upgrade,
the IT department prepared this risk assessment
table showing the possible impact on the environment.
|
Aelita and BindView also had their share of problems.
The support person from Aelita was reluctant to
provide any help even though we had a "crippled"
version that we were evaluating on a limited pilot
group. We received an e-mail from Aelita suggesting
that we purchase the product for support.
The migration using the tools turned out to be
much, much more time-consuming and problematic
than we ever imagined. We were spending all our
time just tracking down small individual problems
related to inconsistent results with the migration
tools. At times it would be a user with an incomplete
SID history; the next time it would be a machine
that failed to migrate or a profile that didn't
get re-permissioned properly. We were also constantly
talking with the vendors and updating or patching
the software. Our team's resources were consumed
with these activities.
April 16: The Big Decision
After about six weeks of this situation, we made
the decision to change tracks, kill the migration
plan, and do an in-place upgrade instead. Our
reasons for the about-face included these:
- Fewer resources needed, both in labor and
money. Migrating groups of people over time
is a longer and more labor-intensive operation.
After an in-place upgrade, all accounts can
log into an AD domain in one fell swoop. Less
coordination and project time are needed for
this process. It also means we won't have to
purchase migration tools, which would be in
excess of $175,000.
- Less impact on users: SID history doesn't
need to be migrated to a new domain. We would
eliminate the problems we've been facing in
doing just that. Domain rights and roles wouldn't
change, and the domain name would remain the
same. Users would still log into HSOR as the
domain name.
- Escalates entry into AD. Our entire domain
could be in AD in six weeks as opposed to six
months. We can more accurately define delivery
dates and focus more energy on security, policies
and other items that need our attention within
AD.
Despite all the time spent on the migration,
the switch to the in-place upgrade method was
one of the better moves our management has ever
approved. We've eliminated all the migration-related
problems, accelerated our Win2K conversion and
reduced the costs and resources tremendously.
Initially, we believed the biggest risks were
with the in-place upgrade. Since then, though,
we've found that migration has much more risk.
The in-place upgrade allows users to continue
to authenticate using BDCs even while we upgrade;
if things don't go smoothly, we can always roll
back using a BDC that we've taken offline. On
the other hand, migration risks a great deal of
instability and unknowns because of the inconsistent
results of using migration tools on a large user
database with an inordinate number of groups.
Large SID histories would have to be maintained
and we would encounter what's sometimes called
"token bloat."
From talking to MCS, I've learned that initially
it was seeing most customers considering migration
due to the unknowns of the AD upgrade. Now it's
seeing around 70 percent of its customers doing
the in-place upgrade. I believe this proportion
will grow even larger in favor of in-place upgrades.
A migration should only be attempted if there
are absolutely overwhelming reasons for doing
so. Our original reason for wanting to do a migration
was because it's always better to build the "new
house" and then move out of the "old
house," if you can afford it.
Lessons
Learned |
We learned so many painful
lessons. We came upon many technical
snags and unadvertised features
that come with any complex product
and large impact project. TechNet
will become your close ally.
It contains a great deal of
useful and timely information
online. Many of your technical
problems will differ, but I
believe these issues have universal
application:
- Before we started, I read
in many places recommendations
that said the majority of
time should be devoted to
planning, to minimize problems.
We did a great deal of planning
up front, but no matter how
much you plan, some things
won't become obvious until
you've tried some of the wrong
things first. Everyone's environments
are layered with complexities
and unique blends of technologies.
No amount of mental preparation
is going to protect you from
all the snags and pitfalls,
so use these as learning opportunities.
For example, if you've just
upgraded your DHCP server
to Win2K and it's no longer
working, take a few minutes
to learn why before resorting
to the back-out plan, even
though your pager's going
off.
- Keep moving the project
forward. Make dates and deadlines
even if they're arbitrary.
When you get hung up working
on a side-issue you've uncovered,
identify it as a separate
project or job and move on.
Keep the main priority and
critical path in mind. The
scope of this work will turn
over many rocks and reveal
some ugly things. There may
be delays, but there's always
some important step that can
be broken into smaller tasks,
leaving some parts that can
be worked on without delay.
- From the start, and many
times during the project,
you need to make sure management
realizes the resources that
need to be dedicated to this
project. Much of this work
is behind the curtain; depending
on your management, it may
think you're not doing much
more than running Setup.exe.
Sending regular status updates
that emphasize the need for
dedicated personnel is paramount.
- To short-circuit the painful,
steep learning curve, get
access to help from someone
who has experience with Win2K
upgrades. What worked well
for us was to have a consultant/advisor
that came in one day a week
to help address major concerns.
This person kept in contact
via e-mail and attended our
weekly AD project meeting.
- DNS. Always consider DNS
first for problems. Double-check
the servers you're pointing
to for DNS and WINS, and check
to see that the records are
being registered correctly.
Know the new IPCONFIG switches.
- Don't let your group get
entangled in the creation
of an OU structure. This is
a never-ending labyrinth with
more political than technical
traps. Start simple and justify
any additions. Group policies
don't have to be applied using
OUs; they can be centrally
applied to the domain using
different security groups.
|
|
|
July 2001: Where We Are
Now
The upgrade is a continuing project. We currently
have a root place-holder domain in native mode;
we're in Mixed mode with our Oregon primary MAD.
We're rolling out Win2K Pro very quickly now using
our standard image and putting these workstations
directly into our new tree. We should have the
Oregon MAD switched to native mode by the end
of August.
It will take more than a year to collapse the
NT 4.0 resource domains. Some of the collapsing
of resource domains will be done quickly after
we switch to native mode, but depending on the
application, some servers will remain in NT 4.0
resource domains until the vendor updates the
application or we move to a replacement system.
Other regions will be adding new domains to our
forest very soon. The adventure continues...
About the Author
Alan Knowles, MCSE, CNE, is a server engineer with a large health care
organization, which has branches in the Pacific Northwest and Western
U.S.