Understanding Bayesian Analysis -- Redmond Channel Partner

Understanding Bayesian Analysis

From the early days to recent times, see how spam and ham differ, statistically speaking.

By Mike Gunderloy
October 01, 2003

Recognizing spam is not as easy as it might seem. For example, Yahoo! Groups put ads at the end of every e-mail, but if your users subscribe to such a group, they probably want to get the e-mail anyhow. Most users would just as soon discard any e-mail containing the word “Viagra,” but if you’re a pharmaceutical company, that might not be a wise policy. Press releases look a whole lot like spam, but discarding them would be a real problem for a working journalist.

Early spam-fighting products relied largely on keyword filtering to spot dubious messages, on the theory that words like “Viagra” and “FREE Offer” and “unsubscribe” only appeared in spam. There are two problems with this approach. First, unlikely though it may be, such words do appear in legitimate e-mail as well. Second, spammers quickly caught on and started sending mail with creative spellings such as “V1agra” and “FREEE Offer” and “un$ubscribe.”

The spam-fighting landscape changed dramatically in August 2002, when Paul Graham published his article “A Plan for Spam” on the Internet (www.paulgraham.com/spam.html). Graham proposed a method of detecting spam by what’s known as Bayesian statistical analysis. While you should go read the article for details, the basic idea is surprisingly simple. Start with a large corpus of spam and a large corpus of “ham” (good e-mail), say several thousand messages of each. Now count the individual words that appear in each corpus. What you’re looking for is words that tend to appear more often in spam than ham, or vice versa. For example, these days the word “Abacha” in my mail occurs exclusively in spam (of the Nigerian swindle variety), while the word “galleys” turns up only in ham (when my editors want me to review galley proofs). By looking at every word in every message, you can build up an extensive list of words and their probabilities of occurring in spam messages. Some words (like “Abacha” and “galley”) have a very high or very low probability of occurring in spam, while others (like “the” or “home”) are distributed pretty evenly.

When a new message arrives, the Bayesian algorithm compares the words in the message to those already in your corpus, looking for the most interesting (defined as having a high or low probability of occurring in spam) 15 or 20 words. Looking at the probabilities of those individual words, you can come up with a probability that the message containing the words is spam. If that probability is high enough, you can be nearly sure that the message was, in fact, spam.

Soon after Graham published his results, Bayesian spam filters started appearing—first on the client and in POP3 proxies, and then later on the server. Bayesian filters now boast a spam recognition rate of 95 percent or better in many settings. The experimental CRM-114 implementation (crm114.sourceforge.net) refines the Bayesian notion for a recognition rate over 99 percent.

The nice thing about Bayesian filters is that it doesn’t really matter what the spammers do; as long as their mail is different from real mail, the filter will learn to recognize it. Word substitutions, for example, end up working against the spammer; the likelihood that a message containing “V1agra” is spam is nearly 100 percent, and after the first few times that goes by, a good Bayesian filter will automatically stamp messages containing that word as spam.

Additional Information on Spam

Outrun the Avalanche
http://mcpmag.com/features/article.asp?editorialsid=362

What's New in Exchange 2003
http://mcpmag.com/features/article.asp?editorialsid=363

Two Services for the Enterprise
http://mcpmag.com/features/article.asp?editorialsid=365

Using DNSBLs
http://mcpmag.com/features/article.asp?editorialsid=366

A Thanks to Hormel
http://mcpmag.com/features/article.asp?editorialsid=367

Spam-Fighting Terminology
http://mcpmag.com/features/article.asp?editorialsid=368

If you’re interested in finding a Bayesian filter for your own e-mail, read Graham’s original article and then start with his list of products that implement this strategy at www.paulgraham.com/filters.html.

About the Author

Mike Gunderloy, MCSE, MCSD, MCDBA, is a former MCP columnist and the author of numerous development books.

Free Webcast! Learn about Password Management Best Practices

Featured

Hands-On AI Skills Now Outshine Certs in Salary Stakes

For AI-related roles, employers are prioritizing verifiable, hands-on abilities over framed certificates -- and they're paying a premium for it.
Roadblocks in Enterprise AI: Data and Skills Shortfalls Could Cost Millions

Businesses risk losing up to $87 million a year if they fail to catch up with AI innovation, according to the Couchbase FY 2026 CIO AI Survey released this month.
Microsoft Cuts Windows 11 Recovery Time with New Update

Microsoft has introduced two key enhancements to Windows 11 aimed at minimizing downtime and streamlining error resolution.
Microsoft Offers Support Extensions for Exchange 2016 and 2019

Microsoft has introduced a paid Extended Security Update (ESU) program for on-premises Exchange Server 2016 and 2019, offering a crucial safety cushion as both versions near their Oct. 14, 2025 end-of-support date.

RCP Update

Email Address*Country*

Please type the letters/numbers you see above.

Understanding Bayesian Analysis

Featured

Hands-On AI Skills Now Outshine Certs in Salary Stakes

Roadblocks in Enterprise AI: Data and Skills Shortfalls Could Cost Millions

Microsoft Cuts Windows 11 Recovery Time with New Update

Microsoft Offers Support Extensions for Exchange 2016 and 2019

Dynamics 365, Power Platform and Copilot Getting AI Boosts in Wave 2 Update

Microsoft Shares Azure Revenue for First Time as Q4 2025 Earnings Beat Forecasts

Microsoft Cuts Windows 11 Recovery Time with New Update

Microsoft Offers Support Extensions for Exchange 2016 and 2019

Roadblocks in Enterprise AI: Data and Skills Shortfalls Could Cost Millions

Dynamics 365, Power Platform and Copilot Getting AI Boosts in Wave 2 Update

Microsoft Shares Azure Revenue for First Time as Q4 2025 Earnings Beat Forecasts

Microsoft Cuts Windows 11 Recovery Time with New Update

Microsoft Offers Support Extensions for Exchange 2016 and 2019

Roadblocks in Enterprise AI: Data and Skills Shortfalls Could Cost Millions

Partner Guides

Partner's Guide to the Windows Server 2008 Deadline

Partner's Guide to Office 365 Security Costs

Partner's Guide to UCaaS

Partner's Guide to Starter Workloads in Azure

Partner's Guide to Microsoft's Fiscal Year 2019

FREE WEBCASTS FROM OUR SPONSORS

Veeam Simplified: The Easy Button for MSPs Protecting Microsoft 365

Tech Talk | Stop Buying What You Already Own: The MSP's Guide to Microsoft 365 Optimization

Veeam Data Cloud for Microsoft 365

Seamless Transitions: Migrating Your VMware Workloads to Azure

FREE WHITE PAPERS FROM OUR SPONSORS

The Easy Button eBook: Simplicity of SaaS-Based Backup for Microsoft 365

Unlocking the power of Microsoft 365 management: A toolkit for MSP success

How MSPs can future-proof Microsoft 365 management with automation and security

The future of M365 management