News

I, Robots.txt

In a recent bout of stupidity, the U.S. Department of Energy apparently accidentally published confidential Homeland Security Department documents marked "For Official Use Only", and the documents remain visible via Google's Web cache.

To avoid situations like this, be sure you've created a properly configured robots.txt file on your Web servers. While it won't prevent confidential documents from being placed on a publicly available server, it is at least one way to prevent such documents from being available in Google's Web cache from now until eternity.

The robots.txt file isn't based on any officially recognized standard, but it has been in existence since 1993 and is generally accepted. Full details can be found here.

The robots.txt file is placed on a Web server to provide instructions to well-behaved Web crawlers or spiders. Anyone can use a crawler, but they're most often used by search engines to collect information about Web sites. The file's role is to provide instructions to the crawler, specifying what directories or files should not be indexed by the crawler. There are basically two lines:
User-Agent:
Disallow:

These lines can be repeated within the same file. The "User-Agent:" line indicates which crawler type the subsequent "Disallow:" lines apply to. You can specify a particular crawler by indicating its User- Agent value (found in your Web logs), or simply specify "*" to indicate all crawlers.

Following the "User-Agent:" line are one or more "Disallow:" lines, typically indicating directories. Files can also be specified if desired. Here's a sample robots.txt file:
User-Agent: *
Disallow: /

These two lines, if placed in the robots.txt file at the root of your Web site, tell crawlers to ignore your site.

It's important to understand that a robots.txt file isn't a security mechanism; it does nothing to prevent crawlers or individuals from searching your site for files to index or view. Only polite crawlers will request the file and honor its contents.

If you want some of your site to be found in search engines, but have other files you want to keep out, you should disallow all directories except the ones you want to make available in the search engine. For example, if you have the following structure on your Web root:
"/": Publicly available information to be put into search engines
"/Dev": Stuff you're working on but don't want published
"/Private": Stuff you definitely don't want published

Your robots.txt file would look like this:
User-Agent: *
Disallow: /Dev/
Disallow: /Private/

To be extra secure, you should put some form of authentication on both the /Dev and /Private sub-directories.

Finally, you might have specified that nothing should be crawled, yet you find crawlers still reading directory pages that should be inaccessible. This is means there's still a link to a page on your site somewhere on the Internet.

Using the previous example, let's say you've got a file named FOO.ASP in the /Dev directory. According to your robots.txt file, it shouldn't be crawled. However, there's no defense if some other site offers up a link like this:
"http://www.yoursite.com/Dev/FOO.ASP"

Crawlers will follow that link to your FOO.ASP page and include it in their searches. There's nothing you can do about this. That's why authentication is a necessary extra step to prevent access.

Russ Cooper is a Senior Information Security Analyst with Cybertrust, Inc., www.cybertrust.com. He's also founder and editor of NTBugtraq, www.ntbugtraq.com, one of the industry's most influential mailing lists dedicated to Microsoft security. One of the world's most- recognized security experts, he's often quoted by major media outlets on security issues.

Russ Cooper's Security Watch column appears every Monday in the Redmond magazine/ENT Security Watch e-mail newsletter. Click here to subscribe.

About the Author

Russ Cooper is a senior information security analyst with Verizon Business, Inc. He's also founder and editor of NTBugtraq, www.ntbugtraq.com, one of the industry's most influential mailing lists dedicated to Microsoft security. One of the world's most-recognized security experts, he's often quoted by major media outlets on security issues.

Featured

  • The 2021 Microsoft Product Roadmap

    From Windows 10X to the next generation of Microsoft's application server products, here are the product milestones coming down the pipeline in 2021.

  • After High-Profile Attacks, Biden Calls for Better Software Security

    Recent high-profile security attacks have prompted the Biden administration to issue an executive order aiming to tighten software security practices across the board.

  • With Hybrid Networks on Rise, Microsoft Touts Zero Trust Security

    Hybrid networks, which combine use of cloud services with on-premises software, require a "zero trust" security approach, Microsoft said this week.

  • Feds Advise Orgs on How To Block Ransomware Amid Colonial Pipeline Attack

    A recent ransomware attack on a U.S. fuel pipeline company has put a spotlight on how "critical infrastructure" organizations can prevent similar attacks.