Bow down before troubleshooting's greatest. These Compaq pros dispense their Windows 2000 wisdom to make you an expert on network repair.
        
        Windows 2000: Troubleshooting Shock Troops
        Bow down before troubleshooting's greatest. These Compaq pros dispense their Windows 2000 wisdom to make you an expert on network repair.
        
        
			- By Gary Olsen et al.
- August 01, 2001
Nobody knows troubleshooting like Compaq. 
        The company's Global Services operation has 15,000 consultants 
        building and maintaining Microsoft-based enterprise solutions 
        globally. More than 3,200 of them are Windows 2000-certified. 
        They're so good at what they do, they supported Microsoft's 
        beta customers for the OS during those companies' deployments. 
      
      These guys have seen it all in the course 
        of their work, from bone-headed migration moves (read 
        on for details!) to brilliant and elusive technical mysteries. 
        They share what they know with each otheraround 
        the world. When a problem arises, chances are, somebody 
        else in the organization has experienced the same dilemmaand 
        has derived a solution. 
      And that's why MCP Magazine asked 
        a core group of them to share their best troubleshooting 
        secrets. What they proposed was massivealmost too 
        comprehensive for a single magazine article. So we let 
        them pick out and identify a number of problemsand 
        their solutionsto share with you. These final choices 
        are dilemmas experienced by a large number of people; 
        they're serious enough to warn you about beforehand; or 
        they help resolve a variety of related issues, such as 
        replication problems. We've divided the troubleshooting 
        evils into four categories: setup and installation, Active 
        Directory (AD), networking and clustering. Read and learn. 
      
      Setup and Installation
      Problem: I've 
        implemented about 50 Remote Installation Services (RIS) 
        servers throughout my organization, but we only have one 
        image. Several of these servers are experiencing problems 
        with insufficient disk space. There's a Single Interface 
        Store (SIS) Common Store directory that has copies of 
        all the files in the image, which seems to be used for 
        multiple images so they can share these files. If I have 
        only one image, can I delete the SIS Common Store and 
        recover the disk space?
      Solution: No. 
        Deletion of the SIS Common Store directory will prevent 
        RIS image files and any other application with files that 
        have been converted to reparse points from accessing the 
        backing file containing the data. In short, it'll break 
        RIS and, possibly, other applications installed on that 
        partition.
      The function of the SIS Common Store, included 
        when RIS is installed, is to conserve disk space by eliminating 
        duplicate files on an NTFS volume. The two SIS components 
        that RIS installs are SIS filter driver and SIS Groveler. 
        SIS Groveler scans for files that are identical to one 
        or more files on the NTFS volume using signatures and 
        byte-by-byte comparison. It then reports the file to the 
        SIS filter driver that creates the SIS link (NTFS reparse 
        points), copies the file to the SIS Common Store Folder, 
        and renames it with an arbitrary 128-bit globally unique 
        identifier (GUID) with a .SIS extension. The original 
        files are changed to reparse points with a "size 
        on disk" equal to the default cluster size of the 
        disk in most cases. Only files larger than 32KB are processed 
        by SIS Groveler. Therefore, we can now have many instances 
        of a file represented by reparse points link to the actual 
        data for that file stored in the SIS Common Store Folder. 
        The file in the SIS Common Store Folder is also called 
        the "backing file" and contains the data. Figure 
        1 is an example of how ntoskrnl.exe is copied to SIS Common 
        Store and renamed. The lower box shows the location and 
        contents of the SIS Common Store folder, located at the 
        root level of the drive. It contains the actual files 
        where the reparse points are directed.
      
         
          |  | 
         
          | Figure 1. How ntoskrnl.exe is 
            copied to SIS Common Store and renamed. The lower 
            box shows the location and contents of he SIS Common 
            Store folder containing the files where the reparse 
            points are directed. (Click image to view larger version.) | 
      
      One caveat: The backup/restore software must 
        be SIS link-aware. Ntbackup is SIS link-aware and will 
        call SISbkup.dll to back up and restore properly. Third-party 
        backup solutions have to know how to call SISbkup.dll 
        to work properly.
      For further information on RIS, see the Windows 
        2000 Server Resource Kit, "Distributed Systems 
        Guide," Chapter 24.
      Problem: I just 
        added a new video card and now my system won't boot. How 
        can I recover without reinstalling Win2K?
      Solution: In 
        Windows NT 4.0, there were several answers to this:
      
        -  Boot to Last Known Good (which sometimes 
          works).
-  Use the Emergency Repair Disk (which 
          no one ever has available or updated).
-  Create a "parallel install." 
          Create a new installation of NT on another partition 
          on the disk, boot to that OS and go to the broken configuration 
          and remove the driver. 
      
Fortunately, Win2K gives us some tools to 
        repair this problem without a parallel install.
      If Last Known Good doesn't work and there's 
        no system state backup, this can be corrected with either 
        Safe Mode Boot or Remote Console.
      Safe Mode Boot is much like Safe Mode Boot 
        in Windows 95 or Windows 98. You can start in Safe Mode 
        by choosing F8 at the boot loader screen, then select 
        Safe Mode. This will enable you to boot the system with 
        a minimum set of drivers and services, which allow you 
        to perform tasks such as disabling a driver or service, 
        including the one causing the problem. Options for Safe 
        Mode are basic Safe Mode, which starts the system with 
        basic drivers; Safe Mode with Networking, which is similar 
        to Safe Mode but includes networking services for connectivity; 
        and Safe Mode with Command Prompt, which doesn't start 
        the GUI. It only starts the command mode.
      The Recovery Console is a new tool that gives 
        you a command-line tool for repairing a system that won't 
        start. You have three options for invoking the Recovery 
        Console: booting from the Win2K CD; booting from the startup 
        floppies; or selecting the Recovery Console from the boot 
        loader screen (assuming it's been installed). Here are 
        the console options: 
      
        - CopyCopies files to another 
          location or name.
-  DelDeletes files.
-  DisableDisables services 
          or drivers.
-  FixbootWrites a new boot 
          sector.
-  FixmbrRepairs Master Boot 
          Record, much like FDISK /MBR in DOS.
      
The Recovery Console also can be customized. 
        For example, you can install it as part of a large deployment 
        by using winnt32.exe /cmdcons /unattend.
      
      
Active Directory
      Problem: I've 
        heard that Win2K has a limit of about 250 sites. Our deployment 
        will require more than 1,000 sites. I've read somewhere 
        that if you have that many sites, you should turn off 
        the Knowledge Consistency Checker (KCC), but that seems 
        like a drastic step. What should I do?
      Solution: This 
        is a much advertised and much misunderstood issue. During 
        the Win2K beta, Compaq was one of the first to see the 
        problem. The more sites, DCs, and the like you have, the 
        longer it takes the KCCwhich by default runs every 
        15 minutesto do its job. When it fires up, it takes 
        about 90 percent of the CPU of one processor on every 
        DC (staggered). So the "limit" is whatever you 
        can live with, remembering that you give up 90 percent 
        CPU utilization on the DCs.
      We believe that with proper design and implementation, 
        there's no need to turn off the KCC. Doing so would force 
        you to do all the KCC's work manually, including creating 
        transitive links, routing around trouble spots, creating 
        and cleaning up connections, forming the topology using 
        the spanning tree algorithm, adjusting for failed bridgehead 
        servers, and so on. I don't believe this is practical.
      
         
          | 
               
                | 
                     
                      | Replication 
                        Repair Tip |   
                      | When it comes to replication 
                        repair, we've found that it's important 
                        to be patient. After making changes, you 
                        can try forcing replication (Replication 
                        Monitor has the ability to push the changes 
                        out to the enterprise), but it's quite 
                        surprising at how many issues get resolved 
                        by just waiting and letting replication 
                        move the changes out naturally. |  |    | 
      
      There are several options you can choose 
        from to get around this limitation. An excellent reference 
        is KB Q244368, "How to Optimize Active Directory 
        Replication in a Large Network," that provides equations 
        to predict the KCC time based on number of sites and domains, 
        as well as good descriptions of workarounds to this problem.
      One is to turn off Auto Site Link Bridging. 
        Using the equations in Q244368, if you have 1,000 sites 
        and five domains, the KCC time is about 45 minutes. That 
        means it takes the KCC 45 minutes to do its job (eating 
        90 percent of the DC's CPU), then goes to sleep for 15 
        minutes, then fires up for 45 minutes. So out of every 
        hour, your DC gets 15 minutes of CPU to do other things. 
        Not good. However, if you turn Site Link Bridging off, 
        this drops the KCC time to about three minutes, eliminating 
        the problem. This eliminates transitive site links, but 
        in a pure hub and spoke configuration, this isn't usually 
        a problem. You can build some "backup" links 
        if you want some redundancy and don't want the KCC to 
        do it.
      Another method is to use Super-Sites, which 
        Compaq employs. Rather than having every location defined 
        as a site, collect several locations into a single site. 
        Because this forces replication in those sites to intra-site 
        parameters (no data compression, urgent replication, and 
        so on.), Compaq requires at least a 2MB link between these 
        sites. Even though Compaq has a number of physical locations 
        in Canada and Japan, because of the high-speed links between 
        location, we only needed to define two Active Directory 
        sites in Canada and two in Japan. Using Super Sites, it 
        reduced 700 locations to about 80 sites.
      In addition to the Design resolutions just 
        noted there are some technical ways to solve this problem. 
        Schedule the KCC to run at certain times on each DC, thus 
        controlling when the CPU is hit. Load balancing is also 
        an issue when you have more than 100 satellite sites replicating 
        to a single hub site and one Bridgehead Server (BHS). 
        With manual intervention you can configure multiple BHS 
        to share the load. Because both of these issues are more 
        critical in a branch office environment where locations 
        are connected with VPN links, Microsoft recently published 
        an excellent white paper, "Active Directory Branch 
        Office Planning Guide." It includes a set of scripts 
        and procedures aimed at scheduling the KCC and building 
        connections for load balancing. Tools of this nature are 
        critical if you plan on turning off the KCC. You can download 
        the white paper at www.microsoft.com/WINDOWS2000/
        techinfo/planning/activedirectory/branchoffice/default.asp.
      By the way, Microsoft has promised that Windows 2002 
        will improve the performance of the KCC significantly, 
        so this problem should go away. [See "Sonic 
        Boom! Windows 2002 Smashes the Barriers" in the 
        July 2001 issue of MCP Magazine for more on this. 
        Ed.]
      Problem: I get 
        Event 1000 and 1001 errors in Application Event Log in 
        five-minute intervals; Group Policy is not taking effect; 
        or \%windir%\sysvol\staging and ...\staging areas folders 
        have large quantities of files.
      Solution: This 
        is usually indicative of a File Replication Service (FRS) 
        issue. Note that Event 1000 is associated with a wide 
        variety of descriptions. In this case it's a Userenv event 
        with the error message "The Group Policy client-side 
        extension Security was passed flags (17) and returned 
        a failure status code of (3)." It's also accompanied 
        by Scecli event 1001 with the message "Security policy 
        cannot be propagated. Cannot access the template. Error 
        code = 3."
      FRS Replication is probably not working. 
        FRS is one of the biggest problem areas in the orignial 
        release of Win2K, but has been improved in Service Pack 
        2. It's responsible for, among other things, replicating 
        Group Policy templates (and changes) to all DCs. When 
        changes are made to a GPO and saved, the changed file 
        is copied to the %systemroot%\sysvol\staging\domain and 
        %systemroot%\sysvol\stagingareas\ compaq.com directories 
        (note that this isn't the sysvol share). The screens in 
        Figure 2 show the result of making changes to a GPO. The 
        file name is NTFRS_CMP_ and is put in both directories.
      
         
          | 
 
 | 
         
          | Figure 2. The two default directories 
            to which changes in GPOs are replicated. | 
      
      The DC then notifies its partners, which 
        pull it and notify their partners, and so on. These files 
        shouldn't stay in the staging folders longer than about 
        10 minutes. This happens for every change and for DFS 
        changes as well.
      To resolve this problem, back up the group 
        policy files from %systemroot%\ sysvol\sysvol\compaq.com\policies. 
        A simple copy to another directory or a network share 
        is fine. You'll be glad you did! Figure 3 shows the Sysvol 
        directory structure. Note that the policies are listed 
        by GUID and exist in the \winnt\sysvol and \winnt\sysvol\sysvol 
        directories. The GPOs in \winnt\sysvol\sysvol\policies 
        are the ones that get edited via the policy editor and 
        are replicated. The gpotool.exe output, gpotool.log, provides 
        a nice mapping of policy name to GUID as shown in Listing 
        1. Note the policy GUID at the top of the section and 
        the "Friendly name" below it.
      
         
          | Listing 1. This log, created 
              by gpotool.exe, maps the policy name to the GUID. | 
         
          | Policy {168F03D2-9E17-443F-9AE5-7BE43A5FA453} Policy OK
 Details:
 DC: mytest.net
 Friendly name: New Group Policy
 Object Created: 11/15/2000 5:47:38 PM
 Changed: 3/16/2001 7:18:45 PM
 DS version: 0(user) 2(machine)
 Sysvol version: 0(user) 2(machine)
 Flags: 0
 User extensions: not found
 Machine extensions: [{C6DC5466-785A-11D2-84D0-00C04FB169F7}
 {942A8E4F-A261-11D1-A760-00C04FB9603F}]
 Functionality version: 2
 | 
      
      In diagnosing FRS problems, it's critical 
        to install Service Pack 2. If you can't install SP2, install 
        SP1 and hotfix Q272567. If you can't install SP1, just 
        install the hotfix. The hotfix can be installed pre- or 
        post-SP1 and is incorporated in SP2. You must minimally 
        install the hotfix or you may never get to the bottom 
        of your FRS problems.
      Other matters to consider: 
      
        - Resolve any AD replication problems. FRS 
          depends on AD replication, so if AD is broken, FRS won't 
          work either.
-  Stopping and restarting the File Replication 
          Service on each DC may fix the problem (watch the staging 
          areasthere will be a visible reduction in size). 
        
If these tasks don't fix the problem, follow 
        this procedure, which uses information from KB Q257338, 
        "Troubleshooting Missing SYSVOL and NETLOG ON Shares 
        on Windows 2000 Domain Controllers," and our experience:
      
        - Stop FRS service on all DCs.
- Navigate to the Registry key HKLM\SYSTEM\CurrentControlSet\ 
          Services\NtFrs\Parameters\Backup/ Restore\Process at 
          Startup and set the BurFlags value to D4 on a source 
          DC. This is usually the PDC emulator.
The BurFlags value is set to D2 on all "satellite" 
        DCs in the domain as shown in Figure 4.
      
        - Start the FRS service on the hub DC and 
          one other DC and wait for FRS to synchronize. Repeat 
          for every DC in the domain. You should see the size 
          of the staging directories change, and maybe even increase 
          as the files are moved. As long as they're changing 
          size, FRS is working. Be patient and let FRS work it 
          out.
- If absolutely necessary, identify the 
          source DC (the one with the most files in the staging 
          directory) and delete the files from the staging areas 
          on the satellite DCs. Then repeat this procedureturning 
          FRS on each DC, one at a timeuntil it's synchronized.
         
          |  | 
         
          | Figure 3. The Sysvol directory 
            structure lists policies by GUID. (Click image to 
            view larger version.) | 
      
      
         
          |  | 
         
          | Figure 4. Setting this Registry 
            value to 2 can help you get FRS working again. (Click 
            image to view larger version.) | 
      
       
      Problem: 
        I get Event 13557 in the FRS Log: "Duplicate Connection 
        Objects."
      Solution: This 
        event, like many in Win2K, has a standard troubleshooting 
        procedure. However, this is a quick fix and may not solve 
        the real problem. While I'm a big fan of the abilities 
        of the KCC, it doesn't do a great job of cleaning up old 
        connection objects. The easy answer is to go to the Sites 
        and Services snap-in, find the server logging these errors, 
        and open the NTDS Settings object. There should only be 
        one inbound connection object from any single DC.
      Duplicate connection objects will break FRS 
        and AD replication if left unresolved. It's possible that 
        eventually the KCC will clean them up; if not, you'll 
        need to do it manually. KB article Q251250, "NTFRS 
        Event ID 13557 Is Recorded When Duplicate NTDS Connection 
        Objects Exist," is a good reference, but my experience 
        has taught me to create a prioritized list of methods 
        to correct this problem, starting at the top and moving 
        down.
      Remove the duplicates. Simply delete the 
        duplicate objects in the Sites and Services snap-in. If 
        they don't come back, you're done. Figure 5 shows duplicate 
        connection objects on Qtest-MDC1 from Qtest-DC2. In this 
        case, you could simply delete one of them to fix the problem.
      
         
          |  | 
         
          | Figure 5. To remove a duplicate 
            object, simply delete it from the Sites and Services 
            snap-in. (Click image to view larger version.) | 
      
      
         
          |  | 
         
          | Figure 6. After deleting the 
            duplicate, make sure you have the KCC recheck the 
            connections. (Click image to view larger version.) | 
      
      If you see duplicate connections from several 
        DCs and don't know which ones to delete, you can delete 
        all of the connection objects, then right-click on the 
        NTDS settings object and go to All Tasks | Check Replication 
        Topology. In Figure 6 we deleted the duplicate connections 
        from Qtest-DC2 and are ready to "Check the Replication 
        Topology." This will fire up the KCC and make it 
        re-evaluate the connections for that DC. It will create 
        the connection objects needed.
      If the duplicate connections get re-created, 
        you need to find out why. The "why" is most 
        likely a DNS misconfiguration or failure. In one case 
        in Compaq's Qtest forest, we noticed a DC in Europe with 
        2,100 connection objects, inbound from a single DC. We 
        deleted them, but within a few minutes there were 24 more. 
        We found that a DNS server had its IP address changed, 
        breaking the delegation. We corrected the delegation, 
        deleted all the connections, forced the KCC to check the 
        topology, and the duplicate connections ceased.
      Problem: When 
        attempting to log on to a Win2K member server or Win2K 
        Pro workstation using a domain account, the following 
        error message appears: "Error: Trust Relationship 
        between this workstation and the Domain Controller Failed."
      Solution: This 
        error is usually caused by the secure channel password 
        for the member server or workstation getting out of sync 
        with the DC, but it could be caused by a time-zone shift 
        between the client and the DC. A typical scenario for 
        this problem would be removing a computer from a Win2K 
        domain, A, and joining it to another domain, B, then later 
        moving it back to the original domain, A. Initially, there's 
        a machine account for this client on the A domain. When 
        it's moved to the B domain, it creates a new account on 
        the B domain and synchs the password with the client. 
        When it's moved back into the A domain, the machine account 
        is still thereit doesn't create a new onebut 
        now the passwords don't match, resulting in the error. 
        I've also seen it caused by moving a computer between 
        time zones and not changing the client's time zone information.
      To resolve this problem, delete the client's 
        computer account from domain A and let replication in 
        the site occur, which should take a maximum of five to 
        10 minutes. Then configure the client to join a workgroup 
        and reboot it. This cleans up all the local machine account 
        information. After the reboot, configure the machine into 
        the domain and reboot again. This will create a new account 
        and synch the passwords with the client. The reboot, which 
        is required anyway, will purge the Kerberos tickets so 
        new ones will be created with the new access information.
      If the problem still exists, it could be 
        a timing issue. Go to the client, open a command prompt 
        window, and enter this command:
      net time \\domaincontroller /set
      
      
      
      
      
where "domaincontroller" is a valid 
        DC name that can be used to synchronize time on the client. 
        Remember that Kerberos requires that the time difference 
        between the two systems be less than five minutes.
      
         
          | 
               
                | 
                     
                      | Be 
                        Resourceful |   
                      | Microsoft doesn't want you flying blind 
                          when troubleshooting. It offers many 
                          useful diagnostic and troubleshooting 
                          helpmates. Learn them, then use them. 
                          They include Support Tools and Resource 
                          Kit utilities. Remember to get verbose 
                          outputwhen troubleshooting, more 
                          knowledge is better. 
                         Support Tools is found on the Win2K 
                          Server and Advanced Server CDs in \Support\Tools. 
                          Just run setup to install them. These 
                          tools are lifeblood, so much so that 
                          they should be installed on every domain 
                          controller (DC). For general AD diagnostics, netdiag.exe 
                          and dcdiag.exe are two of the best. 
                          They'll generate netdiag.log and dcdiag.log 
                          files, which give great information 
                          concerning trusts, DNS, NetBIOS names, 
                          TCP/IP details and more. Nltest.exe is a quick way to return 
                          network information such as a computer's 
                          site, site coverage and a list of DCs 
                          in the domain. You can also use it to 
                          query the domain trusts. When it comes to replication issues, 
                          Replication Monitor and repadmin.exe 
                          are invaluable tools. One of the best of all resources is 
                          Microsoft itself, especially TechNet 
                          (www.microsoft.com/technet). 
                          If you can't afford the CD version, 
                          go to the Web and search the Knowledge 
                          Base at http://search.support.microsoft.com/kb/c.asp.
 |  |    | 
      
      Problem: I just 
        upgraded my NT 4.0 domain to all Win2K DCs and everything 
        is broken. How can I recover my NT 4.0 domain? (By the 
        way, I didn't remove a BDC before the upgrade as Microsoft 
        recommends, and I have no backup!)
      Solution: This 
        scenario describes a call I got from a customer. It's 
        absolutely the coolest thing I've done in Win2K troubleshooting. 
        He had a single NT domain with a PDC and two BDCs. He 
        upgraded the BDC first (don't ask me how), then the PDC. 
        In the meantime, the other BDC had a disk crash. The Win2K 
        domain was brokenno user authentication, no replication, 
        no services. He wanted to recover the NT domain, but had 
        no NT 4.0 machines left and no backup. Fortunately, he'd 
        left it in mixed mode, so he still had a copy of the SAM 
        database. In mixed-mode, you should still be able to add 
        an NT 4.0 BDC and get the NT domain back. Since he was 
        "dead" anyway, we had nothing to lose, so we 
        used the following process and it worked! I've never seen 
        this in any Microsoft document or training course. Here's 
        the process:
      
        - Pick the healthiest DC to be used as a 
          source.
-  Transfer all the FSMO roles to this machine 
          if it isn't the FSMO already.
-  Turn the other DC off.
-  Pre-create a computer account for a new 
          NT 4.0 BDC in the AD. This can be done by using Win2K's 
          Server Manager (svrmgr.exe) or with the netdom command. 
          Warning: Don't use NT 4.0's version of svrmgr.exeit 
          won't work. Win2K's version is built in. To use 
          netdom on a Win2K DC, type: 
netdom add bdcname /domain:domain name 
        /dc
      where bdcname is the name of the new BDC 
        and domain name is the name of the Win2K domain (such 
        as Compaq.com).
      
        - Install a computer (we picked the other 
          Win2K machine we just turned off) as the Windows NT 
          4.0 BDC and join the Win2K domain (using the NetBIOS 
          name, of course). Once this BDC joins the domain, it 
          will sync with the PDC and get the SAM. Now you have 
          the NT 4.0 domain intact on this BDC. Shut down the 
          Win2K DC, leaving only the NT 4.0 BDC.
-  Promote the NT 4.0 BDC to PDC.
-  Reinstall the Win2K DC as an NT 4.0 BDC 
          in the recovered NT 4.0 domain so you're back on solid 
          ground. Add a second BDC for safety, let it sync with 
          the others and pull it offline (which should have been 
          done in the first place).
- Now do the migration right. Upgrade the 
          NT 4.0 PDC and create the Win2K domain.
- Upgrade the BDC to Win2K as a replica 
          DC in the domain.
      
It took the customer the better part of a 
        day to do that, but it worked. He recovered all his accounts 
        and completed the Win2K upgrade. Note: If the 
        original Win2K domain (the broken one) had been changed 
        to Native mode, none of this would have worked.
      
      
         
          | 
               
                | 
                     
                      | Making 
                        Active Directory Happy |   
                      | The two biggest issues with making 
                          sure AD is working properly are DNS 
                          and replication. If they work, AD's 
                          generally happy. Here are some general 
                          replication tips to make sure replication's 
                          working: 
                           Get comprehensive replication error 
                            listings from all DCs in a domain 
                            from Replication Monitor/Action Menu/Domain/Search 
                            DCs for Replication Errors.Get a status report from Replication 
                            Monitor. Right click on a server icon 
                            and select Generate Status Report. Run repadmin.exe /showreps to look 
                            for errors. In Sites and Services or Replication 
                            Monitor, force replication between 
                            two DCs. Force the KCC to regenerate the 
                            topology (Sites and Services or Replication 
                            Monitor). Look for failures. To see if the domain naming context 
                            is being replicated, create a test 
                            user account on a DC, then force replication 
                            to another DC. Look at the Users and 
                            Computers snap-in on that DC and see 
                            if the test user's there. To see if the Configuration and 
                            Schema naming context is being replicated, 
                            create a test site, and force replication, 
                            then see if the other DC gets the 
                            new site. |  |    | 
      
      Networking
      Problem: Why 
        is it when I enter a Route Add command, the route doesn't 
        show up in the RRAS list of static routes?
      Solution: There's 
        been quite a lot of confusion about the different ways 
        to define static routes in Win2K Server. It started with 
        the introduction of RRAS in NT 4.0, but it's still in 
        the product today. This issue must be understood before 
        any network troubleshooting takes place.
      The problem is that Win2K Server allows for 
        two separate ways of adding routes. The best way is to 
        enter the static routes in RRAS.RRAS is a kernel-mode 
        service with sophisticated routing capabilities. The other 
        way, and the result of the ROUTE ADD command, is to enter 
        the routes as a user-mode function. This routing method 
        stems from NT 3.x days and shouldn't be used if you can 
        avoid it. (Microsoft kept it around to avoid breaking 
        existing scripts that customers might have.)
      
         
          |  | 
         
          | Figure 7. The typical output 
            of a ROUTE PRINT command. (Click image to view larger 
            version.) | 
      
      
         
          |  | 
         
          | Figure 8. Persistent routes are 
            automatically established when a system comes online. 
            (Click image to view larger version.) | 
      
      
         
          |  | 
         
          | Figure 9. The Registry can help 
            confirm the persistent routes in your network. (Click 
            image to view larger version.) | 
      
      As Figure 7 shows, there are two interfaces 
        in this system. The default gateway points to 216.82.49.33, 
        and there's an internal card with address 10.0.2.1. The 
        second route states that all 10.0.2.0 traffic is directly 
        available to the internal subnet. For our example of the 
        routing confusion, let's introduce a new internal subnet 
        of 11.11.11.0. The old way of doing this is to issue the 
        following command:
      
      
route add 11.11.11.0 
        mask 255.255. 255.0 10.0.2.1 -p
      
      
The -p option at the end states that this 
        route's persistent and should always exist when the system 
        comes online. Figure 8 shows the result of this command.
      Notice that the persistent route is clearly 
        listed in the routing table near the end Additionally, 
        it's in the Active Routes list.
      Since the route exists in the routing entries 
        list, the network works as expected. In fact, a peek into 
        the registry shows the persistent routes list (just like 
        in NT 4.0). The route's listed as expected, in Figure 
        9.
      We've established that the backward compatibility 
        still exists and works in Win2K Server routing. Now let's 
        move forward.
      Win2K Server has two new ways to add static 
        routes that allow the RRAS engine to handle the entries. 
        The first way is to simply use the RRAS snap-in (see Figure 
        10). This has the advantage of being fairly obvious, but 
        if you have more than just a few entries, this process 
        would be too time-consuming.
      Win2K also introduces a powerful command 
        shell called NETSH. If you have a number of static routes 
        and you need to create or modify a batch file, use this 
        command. The equivalent command to the ROUTE ADD command 
        we were using is:
      netsh routing ip add persistentroute 
        11.11.11.0 255.255.255.0
      "Private" nhop=10.0.2.1
      Here you're defining a persistent route, 
        but you must also define the interface that's handling 
        this route and the next hop address. On this server, the 
        internal address is named Private (Network Places | Properties 
        | Interfaces). Because this route is being handled directly 
        by this server instead of passing it off to another router, 
        our next hop is the same interface. Figures 11 and 12 
        show the ROUTE PRINT result from this command.
      As you can see, neither of the backward compatibility 
        areas contain the new route that we've just added. The 
        ROUTE PRINT command lists it in the routing entries, but 
        doesn't know that it's a persistent route. RRAS, however, 
        does (see Figure 13).
      As you can imagine, this can cause confusion. 
        If you manage servers performing routing functions and 
        you're using static routes, I'd recommend changing from 
        the user-mode ROUTE ADD command to using RRAS routing. 
        The server will be able to handle more traffic with better 
        performance; all your routing information will be in a 
        unified location; and the router will have more flexibility 
        in the RRAS environment.
      When troubleshooting any network or routing 
        issues, it's important to discover the complete picture 
        of the routes applied to a server to fully understand 
        the network details. Make sure that you look in both RRAS 
        and ROUTE PRINT or the Registry list.
      
         
          |  | 
         
          | Figure 10. You can add static 
            routes through the GUI shown here, but for more than 
            a few entries, using a command-line utility is better. 
            (Click image to view larger version.) | 
      
      
         
          |  | 
         
          | Figure 11. This ROUTE PRINT window 
            doesn't show that the just-added 11.11.11.0 route 
            is persistent... (Click image to view larger version.) | 
      
      
         
          |  | 
         
          | Figure 12. ...Nor does the Registry. 
            (Click image to view larger version.) | 
      
      
         
          |  | 
         
          | Figure 13. The route is listed 
            in RRAS, however. (Click image to view larger version.) | 
      
      
      
         
          | 
               
                | 
                     
                      | Windows 
                        2000Built on the Rock of DNS |   
                      | Keep in mind that DNS is the foundation 
                          for Windows 2000, especially when you're 
                          troubleshooting Win2K. DNS will touch 
                          all aspects of the infrastructure. Make 
                          sure it's working and error-free before 
                          digging any deeper into a problem. An 
                          entire article could be written on DNS 
                          troubleshooting alone, but here are 
                          some basics.  
                          Design the DNS structure. Get help 
                            if you don't know how. 
                            
                              Keep it simple. Unless you have 
                                some very slow links to sites, 
                                we usually recommend three name 
                                servers per domain. You may want 
                                more at remote (slow link) sites.Work out interoperability with 
                                your corporate root name server. 
                                There are a number of options 
                                here, and Win2K DNS will play 
                                nicely with BIND servers if you 
                                do it right. Make sure the DNS server and zone 
                            configurations are correct, with delegations, 
                            forwarding and name server lists pointing 
                            to the right IP addresses.Make sure DC names and domain names 
                            are resolved correctly.Make sure client DNS configuration 
                            is pointing to the right name servers. 
                            Assuming a Win2K DNS name server is 
                            hosting the Win2K domain:
                            DNS servers' TCP/IP properties 
                              should point to themselves for preferred 
                              DNS and to the other name servers 
                              in the domain as "additional" 
                              DNS servers. DNS servers at the Win2K root 
                              domain should forward to the name 
                              servers registered on the Internet 
                              for Internet access. This could 
                              be a company-owned or ISP-owned 
                              server. Clients should point to the Win2K 
                              DNS servers authoritative for their 
                              domain. Order them with thsest" 
                              servers hie "cloghest in the 
                              list. Watch the DNS event logs for 
                              errors, but note that DNS errors 
                              will occur in the Directory Services 
                              and System logs as well.  |  |    | 
      
      Cluster Troubleshooting
      In troubleshooting cluster problems, a number 
        of fundamental proactive and reactive tasks apply in almost 
        all cases. 
      
      
Proactive Tasks
        Get to know your cluster. It's hard to zero in on a problem 
        when you don't have a feel for how your cluster behaves 
        when healthy. To do this, make sure cluster logging is 
        enabledyou can't troubleshoot a cluster problem 
        otherwise. It's enabled by default in Win2K; but if you're 
        running NT 4.0 Enterprise Edition, refer to KB Q168801, 
        "How to Enable Cluster Logging in Microsoft Cluster 
        Server," to turn on cluster logging.
      Next, get familiar with the content of the 
        log file. Because the content of the log file is verbose 
        and cryptic, it's often hard to determine if a message 
        is benign or malignant. Therefore, it's good practice 
        to periodically save a copy of the cluster log file on 
        all cluster servers. This can be used as a reference to 
        compare against, once you experience a problem. You should 
        also save a copy after you make changes to your cluster 
        configuration. The cluster log will look very different 
        before and after you've clustered SQL Server 2000!
      Then remember the adage "When it rains, 
        it pours." Since there's a good chance that next 
        time you experience a cluster problem you'll also experience 
        other problems, download and print out some good troubleshooting 
        documentation, including: 
      
        -  Windows 2000 Server Resource Kit's 
          chapter 20, "Interpreting the Cluster Log."
-  KB Q286052, "The Meaning of State 
          Codes in the Cluster Log."
-  If you're running Windows NT 4.0 with 
          the Option Pack, I also recommend Microsoft's white 
          paper "Installing the Windows NT Option Pack on 
          MS Cluster Server (MSCS)."
-  KB Q191138, "How to Install the 
          NTOP on Cluster Server."
-  KB Q223258, "How to Install the 
          NTOP on MSCS 1.0 with SQL Server 6.5 or 7.0." 
Something else you can do is upgrade to Win2K. 
        Clustering is a lot more finicky on NT 4.0 than on Win2K, 
        mostly because Windows NT 4.0 has the Option Pack. Then 
        do the same for your cluster-aware BackOffice products. 
        You can cluster SQL 2000 more reliably than SQL 6.5 or 
        SQL 7.0!
      Finally, I can't overemphasize the need for 
        a good backup. Make backups and once in a while test your 
        recovery procedures.
      
      
Reactive Tasks
        Isolate the errors in the cluster log by comparing what's 
        normal from your saved cluster log with the events logged 
        during the problem time. You might need to look at the 
        cluster log on all servers in the cluster. Remember that 
        the cluster log timestamp is GMT, so you need to calculate 
        GMT based on your time zone setting. Once you've identified 
        a problem area, cross-reference with the event log. Remember 
        that those are in local time, not GMT! In Win2K you only 
        need to look at the event log from one server since it's 
        replicated among cluster servers.
      If you don't understand an error code, use 
        the "Net HelpMsg" command to try to get a better 
        description of the error. Also use the Knowledge Base 
        whenever possible.
      This last bit of advice might come as a shock. 
        Most likely, your No. 1 requirement is to get the cluster 
        and application working as fast as possible. Most important 
        is to understand the root cause of the problem so you 
        can prevent it from occurring again. Once you know what 
        caused the problem, consider all the optionsyou 
        can attempt to fix the problem or you can re-install. 
        I've found that very often it's faster to re-install a 
        server or a cluster than to fix a complex problem. This 
        option is often overlooked until many hours have been 
        spent fighting a complex problem. The solution path you 
        take will depend on the clustered application.
      
      
Troubleshooting Disk 
        Problems
        So, what to do if the problem isn't the cluster, but the 
        disk? If your disk problem occurs right after you installed 
        your cluster, it's probably a misconfiguration. Backtrack 
        and verify the integrity of your shared I/O subsystem 
        without clustering. In general, it's easier to troubleshoot 
        standalone systems than clustered servers. Don't hesitate 
        to stress-test your disks before you clusterSCSI 
        termination problems can hide when doing casual checks. 
        The simplest stress test might be a full (not fast) format 
        of the disk.
      If, however, your disk problem occurs on 
        a mature cluster, it's most likely caused by hardware 
        failure. Since disk handling is very different in Win2K 
        than NT 4.0, make sure you follow the procedure for the 
        right version of Windows. 
      For NT 4.0 check out: 
      
        -  KB 
          Q217224, "How to Replace a Clustered Disk in 
          Windows NT 4.0 Enterprise."
-  KB 
          Q243195, "Event ID 1034 for MSCS Shared Disk 
          After Disk Replacement" (which explains how to 
          fix the disk signature).
 
For Win2K check:
      
        - KB 
          Q280425, "Recovering from an Event ID 1034 
          on a Server Cluster." (The DumpCFG utility can 
          be found in the Resource Kit.)
-  KB 
          Q217224, "How to Replace a Clustered Disk in 
          Windows NT 4.0 Enterprise."
-  KB 
          Q243195, "Event ID 1034 for MSCS Shared Disk 
          After Disk Replacement." (This explains how to 
          fix the disk signature.)
It's likely you'll need to disable clustering 
        temporarily and access your disk directly. Remember this: 
        Once you disable cluster service and the cluster disk 
        driver, make sure that you never boot more than one 
        server at a time or you will corrupt your shared 
        disks!
      To access disks without cluster software 
        involvement temporarily:
      
        - Shut down and power off Server B.
-  Follow one of these route:s 
If you're running NT 4.0:
      
        - On Server A, from Control Panel | Services 
          change the startup of the Cluster Server service from 
          Automatic to Disabled. To do this, highlight the Cluster 
          Server service, and select Startup. Note: 
          Don't stop the Cluster Server service.
- From Control Panel | Devices, change the 
          startup of the Cluster Disk device from System to Disabled. 
          To do this, highlight the Cluster Disk device, and select 
          Startup. Note: Don't stop the Cluster Disk 
          device.
If you're running Win2K:
      
        - On Server A, right-click on My Computer, 
          then select Manage. The Computer Management (Local) 
          snap-in comes up.
-  At the bottom, expand "Services 
          and Applications" and select Services. Right-click 
          on Cluster Service and expand Properties. In the Startup 
          type box, click the dropdown arrow and select Disabled. 
          Then select OK to go back to Computer Management. Note: 
          Don't stop the Cluster Server service.
-  At the top, select System Tools and highlight 
          Device Manager. The visible devices appear in the results 
          pane. On the toolbar, select View and click on Show 
          Hidden Devices. A Non-Plug and Play Drivers option will 
          appear in the results pane. Expand that. Right-click 
          on Cluster Disk Driver and select Properties, then click 
          on the Driver tab. The Startup box will be at the bottom. 
          Click on the options dropdown arrow and select Disabled. 
          Then select OK to return to Computer Management. Note: 
          Don't attempt to stop the Cluster Disk device.
-  In the results pane, right-click on Cluster 
          Network Driver and select Properties as before and select 
          the Driver tab. Select Disable and OK. Note: 
          Don't stop the Cluster Network device. 
        - Finally, reboot Server A.
-  After the reboot, verify via the proper 
          disk administration utility your access to the shared 
          storage devices. The shared disks should show up as 
          available and online. If you still have disk problems, 
          it wasn't a cluster problem. If you need to format a 
          disk, do it now. If you need to set the disk signature 
          (in Win2K), do it now. If you want to perform some I/O 
          test, do it now. If you need to restore some data, you 
          can also do it now.
- When you're finished working with the 
          disks in non-clustered mode, on Server A, follow one 
          of these paths:
For NT 4.0:
      
        -  From Control Panel | Services, change 
          the startup of the cluster service from Disabled to 
          Automatic.
-  From Control Panel | Devices, change 
          the startup of Cluster Disk device from Disabled to 
          System.
For Win2K:
      
        - In Computer Management, under Services, 
          reset the Cluster Service to Automatic.
-  In Computer Management, under Device 
          Manager, display the hidden devices and reset the Cluster 
          Disk device to System.
-  In Computer Management, under Device 
          Manager, reset the Cluster Network to System.
-  Reboot Server A, then restart Server 
          B. 
         
          | 
               
                | 
                     
                      | How 
                        To Become an Expert Troubleshooter |   
                      | Let's review some of the basic troubleshooting 
                          steps we use when dealing with a Windows 
                          2000 problem at a client site. 
                           Gather all the information about 
                            the problem from the person experiencing 
                            it. (This will be easier for you than 
                            for us because you know your environment.) 
                            To start, ask some probing questions: 
                              "What was the exact error message?" 
                              Get the user to e-mail a screenshot, 
                              if necessary. "What were you 
                              doing when it happened?" Get 
                              the account and computer used, applications 
                              running, and so on. "Have you 
                              seen this before?" If so, get 
                              exact details of the previous incident. 
                              "Was it working before?" 
                              "Is there anything else that 
                              doesn't work?" "What changed 
                              prior to this problem in your environment?" 
                              Something had to change if it just 
                              "quit working." Next, remember that logs are your 
                            troubleshooting friends. Get event 
                            logs, from both the client and server. 
                            You'd be surprised how many customers 
                            call us before ever looking at the 
                            event logs or getting exact error 
                            messages. Several Registry settings 
                            permit you to dump verbose output 
                            to the event logs for a variety of 
                            things such as replication, name resolution 
                            and Group Policy application. See 
                            Microsoft Knowledge Base articles 
                            Q220940, "How to enable diagnostic 
                            event logging for Active Directory 
                            services," and Q186454, "How 
                            to enable user environment event logging 
                            in Windows 2000."
                            On the same topic, don't get just 
                              any logsget relevant logs 
                              such as dcpromo.log, userenv.log, 
                              startup.log and netlogon.log. Win2K 
                              has provided improved troubleshooting 
                              capabilities with these logs, so 
                              use them. If you haven't discovered 
                              the userenv.log, see Q221833, "How 
                              to enable user environment debug 
                              logging in retail builds of Windows 
                              2000."Check out network connectivity. 
                            Make sure everyone can talk to everyone 
                            else. If not, find out if others on 
                            different subnets or remote sites 
                            are experiencing the same problem 
                            or if it's isolated to a particular 
                            site. Determine if it can be reproduced 
                            elsewhere.Next, check Group Policies in Win2K. 
                            They're complicated, to put it mildly, 
                            and can cause a host of problems. 
                            This touches network and domain security, 
                            desktop environments, account authentication 
                            and software installation.  |  |    |