In-Depth

File Replication Service: User-Friendly at Last

The File Replication Service (FRS) has a justifiably bad reputation for bugginess and indecipherable logs. But recent changes from Redmond make it worth another look.

The File Replication service (FRS) is a mystery to most Windows administrators I’ve talked to. And it’s no surprise why: It’s not properly covered in training courses; it has debug logs that are nearly impossible to interpret; and in the early days of Windows 2000, FRS was unreliable and error-prone. As a workaround some companies disabled FRS service and implemented RoboCopy.

A couple of years ago, Microsoft provided a series of Perl scripts—
Topchk.cmd, Iologsum.cmd, Connstat.cmd and List.exe—that provided much-needed parsing for the logs produced by the NTFRSUTL.exe command-line tool. That helped—the logs could now be formatted. But you still needed a Ph.D. in FRS to interpret them. If you’ve ever tried to fix a null server reference object, figure out why a domain controller is stuck in a vvjoin, or couldn’t fix a file with an invalid parent GUID, you know what I mean.

How do you interpret information from those logs to solve Event Log errors? The answer, of course, was to call technical support and spend money and time getting help. Recently, though, Microsoft has empowered us to do a higher level of troubleshooting through the release of several powerful tools and an incredible help file.

In the next few pages, we’ll explore some FRS basics, review the top FRS issues identified by Microsoft, and cover how they’re addressed in Win2K Service Packs and Windows Server 2003. Finally, we’ll take a closer look at powerful tools like Sonar, Ultrasound, FRSDiag, and the Ultrasound.chm help file.

A Brief FRS Overview
FRS was implemented in Win2K to replicate the contents of GPOs and scripts. It’s also used by the Distributed File System (DFS) for data synchronization between assigned members in a replica set.

FRS communicates with replication partners to determine when changes are made to the replica set (Sysvol or DFS) and then replicates that data to all downstream partners. It’s a multi-threaded, multi-master replication engine. FRS relies on Active Directory for its replication topology (NTDS connection objects) and specific replica set information, such as partners. FRS is dependent upon AD objects and AD Replication, which in turn depends on Connectivity, DNS and Remote Procedure Calls. This is vital to remember when troubleshooting.

Common FRS Problems and Solutions
Before you start troubleshooting FRS, make sure you have the latest service pack plus any FRS-specific hotfixes. Let’s look at some of the common problems and how Microsoft has solved them, or at least made FRS more tolerant.

Junction Points
Also referred to as reparse points, directory junctions and volume mount points, a junction point is a physical location on a hard disk that points to another location on a disk or storage device. Think of junction points as links in the file system, sort of a tunnel that binds two ends into one; it connects two locations on the disk to each other.

Removal of a junction point will cause FRS replication to fail. Likewise, copying the junction point will create another Sysvol tree.

Morphed Directories
Morphed Directories and files have been replicated to a target that already has an exact copy of them. FRS can’t tell which one is most recent, so it creates a duplicate copy, referred to as a “morph.” These duplicate directories or files are renamed by prefixing the name with NTFRS_xxxxxxxx, where “xxxxxxxx” is a random eight-digit number. This usually occurs if an Authoritative Restore (discussed later) takes place, forcing an entire Sysvol tree to multiple replica set members at the same time. The administrator must decide which is the newest, most correct version to keep. If it’s the morphed version, delete the original and rename the morphed folder by eliminating the NTFRS_xxxxxxxx prefix. If it’s the original, delete the morphed version. Morphed directory contents aren’t replicated; if it’s more recent data, you may lose changes if not resolved. For more information, see Knowledge Base 328492, “Folder Name Is Changed to “FolderName_NTFRS_.”

Parallel Version Vector Joins
When a new DC joins the domain, a “version vector” is created and distributed from the new DC to each of the other DCs in the domain, to make sure each of the replication partners has the right version of the Sysvol data. In Win2K, this process caused a lot of grief because it pulled the entire Sysvol tree from every DC in the domain at the same time, in parallel. This caused problems not only in network performance but in DC performance, since it has the potential for taking a DC offline during the process. Windows 2003 and Win2K SP3 have corrected this by making it a serialized process. The new DC will do a Version Vector Join (Vvjoin) during promotion; then, after completion, it will contact other DCs in the domain, one at a time, for changes. If the source DC is up to date, the Vvjoin is still done to the others, but no replication takes place.

Staging Area Problems
This is an oldie but a goodie; however, there are still many administrators not aware of this important issue. Changes made to files in Sysvol are copied to temporary files in two staging direc-tories: %systemroot%\sysvol\staging\ domain and %Systemroot%\sysvol\ staging areas\. The files stay there until all downstream partners have pulled it.

But some programs that scan the files, such as anti-virus and defragmenter programs, modify the security descriptors of the files. This forces a change order, causing all files in the Sysvol tree to be copied to the two staging directories. Setting File System Policy in a Group Policy to apply to the Sysvol tree does the same thing. Prior to SP3 and Windows 2003, this resulted in huge numbers of files being dumped into the staging directories, exceeding the 660MB limit and causing FRS replication to stop. There’s a Registry key to increase this limit, but that’s just to give you some breathing room until you can resolve the problem (see KB 264822, “File Replication Service Stops Responding When Staging Area is Full”).

Note: Most antivirus vendors now have FRS-friendly versions of their products. If you ask and they don’t know whether it’s FRS compatible, find another vendor: This is a well-known problem and they should have a solution. For more information, see KB 815263, “Antivirus, Backup, and Disk Optimization Programs That Are Compatible with the File Replication Service”.

Microsoft’s made improvements on this issue in Win2K SP3 and Windows 2003 in two ways.

1. Reduction of excessive FRS replication (see KB 811370, “Issues That Are Fixed in the Post-Service Pack 3 Release of Ntfrs.exe”). FRS detects these unnecessary updates to the files (presumably based on frequency) and suppresses the updates. The administrator is notified with event ID 13567 in the NTFRS event log. This was available as a Win2K post-SP3 hotfix (811370) as well as Windows 2003. It’s described in KB 315045, “FRS Event 13567 Is Recorded in the File Replication Service Event Log After You Install Service Pack 3.”

2. Replication isn’t stopped if the staging directory is filled (see KB 307319, “Changes to the File Replication Service”). In Win2K SP3 and Windows 2003, when the staging directory reaches 90 percent capacity, the oldest files are deleted until it’s reduced to 60 percent, thus preventing replication from stopping and taking the DC offline. Note that this isn’t a fix; the fix is to find out what’s causing the huge volume of files to be dumped into the staging area.

Journal Wrap
The NTFS Change Journal, which FRS uses to identify changes made to Sysvol data, was simply increased to 128K in Win2K SP3 and 512MB in SP4, a dramatic increase over the Win2K RTM limit of just 32MB. This should significantly reduce the opportunity for experiencing journal wrap errors and the resulting non-authoritative restore.

Authoritative and Non-Authoritative Restore
Authoritative and Non-Authoritative Restore in FRS aren’t related to authoritative and non-authoritative restore in AD. In FRS-speak, these terms refer to a restore of the Sysvol tree only. They use a Registry key—BurFlags (for backup and restore flags)—to modify FRS behavior. Located at:

HkeyLocalMachine\System\Current ControlSet\Services\Ntfrs\Parameters \Backup/Restore\Process at Startup

the BurFlags Dword value is set to trigger FRS replication. Setting it to D2 on two machines performs a non-authoritative restore. Setting a source to D4 and all other satellite DCs to D2 forces the satellites to pull from the source, causing a full synchronization among DCs.

Warning: Be very careful with using both type of restores, as improper action will be dangerous to your DCs health (and yours too, if you take down the domain). Always find the root cause before proceeding with this process.

 Authoritative Restore. Authoritative restore, sometimes referred to as “D4” because of the BurFlags setting used to enable it, uses a “big hammer” approach to getting Sysvol on all DCs in sync with a single source. Though Microsoft now says D4 was never intended to be a “silver bullet” solution to FRS issues, it was used extensively during the days when anti-virus products were first found to be filling the staging areas. Today there probably aren’t a lot of valid reasons to do an authoritative restore.

Authoritative restore leaves the file structure in tact and simply backs up and restores Sysvol data. It assumes that all DCs in the domain hold corrupt or incomplete copies of the Sysvol tree and that the NTFRS database is corrupt. This needs to be investigated and resolved to prevent this situation from reoccurring.

 Non-Authoritative Restore. Sometimes called “D2,” non-authoritative restore is the “little hammer” approach. Unlike the authoritative restore that syncs all DCs to a common source, non-authoritative restore syncs one out-of-date DC with an up-to-date source. Thus, only one source and one satellite are involved. This is less intrusive than the Authoritative Restore because it can only mess up two DCs, rather than all of them.

Unlike Authoritative restore, there are good reasons for using this. When a serious FRS error occurs such as a Journal Wrap error, Win2K behavior is to automatically perform a non-authoritative restore on the DC that experiences the error. Since this takes both DCs offline for a time, Windows 2003 doesn’t do this automatically. Instead, it flags the condition with an event ID 13568 to allow the administrator to perform this at a convenient time.

Diagnosis and Troubleshooting
There are a couple of ways to test the overall health of FRS. A good way to see who’s replicating to whom is to create an empty text file, name it after the DC it’s on (i.e., dc1.txt) and place it in the %systemroot%\sysvol\sysvol directory. Do this on every DC in the domain, then wait for end-to-end replication to occur. Every DC should have a text file from every other DC. For instance, if there are four DCs in the domain, DC1, DC2, DC3, and DC4, you would create dc1.txt on DC1, dc2.txt on DC2, and so on. After replication, each DC should have dc1.txt, dc2.txt, dc3.txt, and dc4.txt. If DC4 is missing DC1.txt, there’s an inbound replication problem from DC1 to DC4.

There are a variety of ways to collect logs on suspect DCs: The NTFRS_xxxxxx.log files in %systemroot%\debug; those generated by NTFRSUTL.exe; and the Event Logs. The problem is interpreting them. This takes experience and in-depth of knowledge to apply that information and resolve the problem. Microsoft now provides four powerful tools to help the average admin diagnose and troubleshoot FRS problems:

 Sonar. This tool monitors FRS data such as file backlog, errors, missing Sysvol shares, and so on for all DCs in the domain (see Figure 1). Findings are presented in a table format with options for refresh frequency and categories such as replication status.

The Sonar troubleshooting tool
Figure 1. The Sonar troubleshooting tool monitors FRS data such as file backlog, errors and missing Sysvol shares. (Click image to view larger version.)

 Ultrasound. This tool goes beyond Sonar. It hooks to a SQL database (Microsoft SQL Server Desktop Engine will work) and provides historical data. It also has a feature that can send e-mail in the event of a failure, and other goodies.

 FRSDiag.exe. As shown in Figure 2, it allows you to click check boxes for the types of data you want, then runs the appropriate utility to get it. It’s like customizable MPS Reports in that regard. It also produces an FRSDiag.txt file, similar to the DCDiag.exe tool used for AD diagnostics.

The FRSDiag.exe tool
Figure 2. The FRSDiag.exe tool lets you customize the report data from a variety of sources. (Click image to view larger version.)

 Ultrasound Help File. Simple, yet perhaps the most powerful of all the tools, this file is powerful because Microsoft’s channeled its experience and knowledge into providing descriptions, causes and solutions to errors and problem conditions. It also contains FRS operation basics, terminology and information about the previously discussed tools.

The Ultrasound Help File thus becomes a desktop reference for all FRS events, errors and problem conditions. It’s extremely powerful in helping resolve FRS issues without involving tech support. Figure 3 shows one of my favorites—the Event ID list. All FRS related event IDs are in the left pane. In this example I selected Event 13568, the Journal Wrap error. The right pane describes the description and the resolution. No searching the Microsoft Knowledge Base or Google. It’s right there.

The Ultrasound Help File
Figure 3. The Ultrasound Help File is one of the best new things about FRS. It's comprehensive and easy to understand. (Click image to view larger version.)

Another powerful feature of the Help File is the FRS Troubleshooting section. Figure 4 shows a table showing how to interpret the event IDs. Note how it has key phrases like “Servers Missing Inbound Connections” and provides details on how to troubleshoot this error. Thus you can take information gleaned from FRSDiag.exe and look it up here. This file is available as a separate download at www.microsoft.com/downloads. Click on the FRS Monitoring Help File link.

The Ultrasound Help File at work.
Figure 4. The Ultrasound Help File at work. Here it not only shows the problem ("Servers Missing Inbound Connections") but gives possible causes. (Click image to view larger version.)

Another helpful document is the “FRS Technical Reference” found at www.microsoft.com/technet, which contains much of the Help File contents.

Give FRS Another Chance
FRS is stable and reliable if you’re running at least at Win2K SP3 or Windows 2003. There are fairly sophisticated tools for monitoring and diagnosis. There are also a lot of useful articles in Microsoft’s Knowledge Base. If you’ve been bitten in the past by FRS problems, give it another chance. If you’re using RoboCopy as a substitute, compare it to FRS; you just might go back.

Special thanks to Chris Jaramillo of HP and Dan Boldo of Microsoft for their contributions to this article.

This excerpt is from the forthcoming book, Windows 2003 and ProLiant Servers, by Gary Olsen and Bruce Howard. All rights reserved. Published with permission from Prentice Hall Professional Technical Reference.

About the Author

Gary is a Solution Architect in Hewlett-Packard's Technology Services organization and lives in Roswell, GA. Gary has worked in the IT industry since 1981 and holds an MS in Computer Aided Manufacturing from Brigham Young University. Gary has authored numerous technical articles for TechTarget (http://searchwindowsserver.techtarget.com), Redmond Magazine (www.redmondmag.com) and TechNet magazine, and has presented numerous times at the HP Technology Forum, TechMentors Conference and at Microsoft TechEd 2011. Gary is a Microsoft MVP for Directory Services and is the founder and President of the Atlanta Active Directory Users Group (http://aadug.org).

Featured