The Honeymoon's Over
This would bring anyone back to reality quickly.
- By Eric Beeks
- October 01, 2003
We were planning to do a major rollout of a new warehouse management system the weekend of Dec. 7. The only catch: I was getting married on Dec. 1, so the rest of the crew would have to start without me. I convinced everyone it would be no problem: I’d stage all the PCs before I left, and the servers were ready to go. The developers would just need to load the data. I figured that should keep everybody busy until I got back the morning of Dec. 8.
My bride and I landed from our Lake Tahoe honeymoon at 6:30 p.m. on Dec. 7. As we pulled into our driveway, I noticed a note taped to the front door. It said, “The network is down. Call as soon as you get in!” Inside the house, my answering machine had two urgent messages from co-workers.
I called in and was told that none of the warehouse PCs could log onto the network, and the servers weren’t accessible. As I drove across town, I kept thinking it couldn’t be the computers themselves; they were used for training the warehouse personnel the week before. I’d checked the network connections before I left. Once in the office, I found out that the rollout team had called a local consulting company to try to bail them out. The consultants had suggested reinstalling Windows 2000 on the domain controller (DC). My co-workers’ response: “No way! He’ll kill us if we let you do that!”
I started by checking the physical connections of a select few computers. They checked out OK. I went back to my desk to use my admin tools and discovered that my computer couldn’t log onto the network, either. I began to think the problem was related to the DCs and not individual workstations or member servers. I also understood why the consultants wanted to reinstall Windows.
I logged onto my workstation using a local account and pinged the PDC emulator, which replied. I checked the emulator’s event log and saw that it had pages upon pages of errors related to the File Replication Service (FRS). It essentially said that NTFRS was preventing the computer from becoming a DC. I was now on the right track.
I went back to my computer and pointed my browser to support.microsoft.com for fixes. I eventually uncovered a page explaining the problem. By default, FRS waits for the Sysvol share to become available before allowing the server to be promoted to a DC. There’s a registry entry that, when changed from zero to one, allows FRS to ignore the fact that the Sysvol share doesn’t exist and to allow logons to process.
After making the registry change and restarting FRS, it began to work. I went out to the warehouse and tested a few computers. They all started working. I checked connectivity and rebooted a few of them just to make sure they could see the other computers on the network and could still log on.
I felt confident enough to call the project manager and get everyone back in the office to get the new system installed, a job that was a day behind. The database team went to work getting the database and interfaces installed. I checked the other servers to see if they had errors similar to the PDC emulator. Normally, I wouldn’t have let anyone on the system until I knew what had happened and had completely resolved the problem, but we were in a time crunch to get the system installed.
I noticed that event logs on my second DC had errors saying it couldn’t replicate with the PDC emulator. That meant all I’d done was force the server to become a DC, and the problem wasn’t truly fixed. I looked for some support documents on the Web and couldn’t find anything similar. I tried everything I could think of—even building another DC—but the problem replicated to the new server.
I considered giving in and restoring from tape. That’s when I found out
my cohorts had forgotten to change the backup tapes all week long, The
only backup I had available was from the night before, and it was a backup
of the broken system. So I decided to shut down all nonessential systems
and demoted all my DCs except one.
My next attempt was to fix one DC and replicate the good information to all the servers. While this might have been a goofy idea, it was the best I had. After the event logs were clear of errors, I shut them down. Then I discovered that, when I forced the DC online, it moved every file and folder from the Sysvol folder. I moved them back and voila! All my scripts and group policies started working again. I restarted FRS too determine if it started without errors. Nope! Everything got moved right out of Sysvol again.
I decided to remove all group policies. I only had a few and figured they’d be easy to recreate. Still nothing. Hoping that the contents would be recreated when I started the service, I deleted everything in Sysvol and tried again. Still nothing. I was running out of time, so I picked up the phone to call Microsoft tech support.
Tech support ran “health scripts,” which are used to determine the state
of the file server. The scripts provided the information needed to determine
the problem. They figured out that the contents of the Sysvol folder were
While I was on my honeymoon, FRS had stopped replicating the Sysvol folder to the other DCs (we never found out what caused it to stop replication). When this happened, FRS refused to allow the server to be a DC. Thus, the DCs stopped authenticating logons and allowing access to network resources. The programmers panicked and rebooted the servers, hoping that rebooting would fix the problem. It didn’t. In fact, it made the problem worse, since we could no longer log onto the DCs. Because the DCs had no local accounts, we couldn’t log on at the console either. Nobody could log onto the network; not even the Domain Administrator account would work. However, all local accounts on the workstations were still working. The only computers that still had access to network resources were the ones that hadn’t been shut off.
The support folks at Microsoft had me manually recreate the contents of the Sysvol folder structure and try restarting the NTFRS service and, subsequently, the file server. I checked the event log. No errors. I rebooted the server just to be sure. After several minutes, I turned it off and booted the DNS and the DC servers together. Checking the DC’s event log revealed no errors. I logged onto the DNS server and it started cleanly. Feeling good about the progress we’d made, I went to the warehouse to see how things were going. This also let the system replicate a few times before performing another event log check.
I promoted the DCs, took a deep breath and checked the event logs. No errors. I went out to the warehouse to see how things were going and talked with the project manager. He asked how the honeymoon was. I just laughed and went home.