Events Training Consulting Newsletters Webcasts Blogs
Subscriptions
Current Issue
Past Issues
Join Our Mailing List
Contact Us
Home
 
 
 

 


TechEncyclopedia

Bayesian Spam Filtering

An anti-spam algorithm that is now widely available in the form of plug-ins for Microsoft Outlook and other proprietary clients.

By Andy Dornan

print this article print this article
email this article e-mail this article
.

Miami Advice: The Best of Call Center Demo
Telecorp Products Launches CentrEE IQ Plus
Merced Systems Releases Performance Suite 3.0
New Methods to Measure Performance
Call Center Spotlight: U.S. Cellular Corporation
The New Math of Customer Interactions
Best Practices in Facilities and Design
Witness WFM Update
AudioCodes Launches IP-PBX Recording Solution
When Disaster Strikes, Will You Be Ready?
.

03/03/2004, 2:00 PM ET

Bayesian filtering is an anti-spam algorithm that reached cult status during 2003. Beginning as a piece of code that could be added to open-source e-mail software, it's now widely available in the form of plug-ins for Microsoft Outlook and other proprietary clients. Users often see their junk e-mail eliminated entirely, prompting vendors of network-level anti-spam technology to take a closer look.

Network managers would be wise to do the same. But be warned: The networking vendors' implementations are unlikely to be as successful as those aimed at e-mail clients. The algorithm is designed to protect an individual inbox and doesn't easily scale to an entire SMTP server.

BLOCKING BY BLACKLIST

Traditional spam filters are based on a series of well- defined rules. The simplest is a blacklist that deletes messages matching certain criteria-for example, a subject line that contains the phrase "e-mail marketing" or an address at a nonexistent domain. Blacklists are often used by individual clients, but their greatest attraction is that they can easily be applied to an entire network. By placing a dedicated anti-spam appliance at the network edge, you can prevent a lot of spam from reaching your SMTP server. A more radical approach is to ignore message content and block by IP address, which has the added benefit of preventing spam from consuming expensive WAN bandwidth.

The problem with blacklists is that they're mostly reactive. Lists need to be updated to match new spam, so they gradually get longer, potentially excluding an increasing amount of legitimate mail. As the volume of spam grows, some users replace blacklists with whitelists, filters that delete all mail except those originating from select addresses. Unlike a blacklist, an effective whitelist can't easily be implemented network-wide because different users need to receive e-mail from different people.

Few users can afford to reject all mail from people they don't know, or even risk that the occasional message from a potential customer will accidentally be deleted. To reduce false positives, anti-spam appliances now offer the option of "graylists," filters that direct suspected spam into a specific directory instead of the main inbox. Because it isn't actually deleted until a user glances at the subject line, the spam still consumes bandwidth, storage space, and users' time.

BAYES' BASICS

Bayesian filters are designed to handle these shades of gray, dealing in probabilities rather than definite rules. They're named after Bayes' Theorem, the equation they use to calculate the likelihood that a message is spam. Unlike a filter, this calculation takes into account both the factors that mark a message as spam and the factors that mark it as legitimate e-mail, thus ensuring that evidence is weighed fairly.

When a Bayesian filter receives an e-mail message, it scans for particular tokens-strings of characters that are likely to be significant in determining whether or not a message is spam. A token is usually a word, but it doesn't have to be: The best Bayesian filters scan message headers as well as bodies, in which case tokens are likely to be IP addresses or domain names.

For every token, the Bayesian filter's database contains a probability value that measures how likely it is to appear in spam versus regular e-mail. The clever part is that these probabilities are themselves continuously recalculated, based on the e-mail and spam that a particular user receives. This means they quickly adapt to new spammer tricks and are automatically customized to fit each user's definition of spam. Users don't have to understand probability theory or fiddle with filter settings. All they have to do is train the filter by marking which incoming messages they want to receive and which they don't.

The key to Bayesian filtering's success is that everyone's e-mail is different. While tokens signifying spam don't vary much between users, those signifying useful e-mail do. For example, they may include the names of a user's friends and family members, or technical terms related to a particular profession. To get around a customized Bayesian filter, a spammer must customize a message for every user, and by definition, spam isn't customized.

TOKEN STRING

Despite its advantages, Bayesian filtering isn't practical in every situation. The need for each user to have a customized set of tokens means it must be applied at the client level, not the server level. Bayesian filtering will save time and increase productivity for users, but it won't prevent Internet connections and SMTP servers from being overwhelmed. The algorithm's adherents argue that it could eventually help reduce the volume of e-mail traffic by making spam unprofitable, but others consider this prediction overly optimistic. Spam will be around for as long as idiots think it's profitable, even if the companies whose products it advertises generate more bad will than sales.

Nevertheless, anti-spam appliance vendors are beginning to incorporate Bayesian filters alongside their other techniques. Although their versions can't be customized for every user, they can be customized for a particular enterprise: A university might use the term "college degree" in its legitimate mail, whereas a drugstore would be more likely to use the word "Viagra." Whatever business you're in, your industry probably uses some jargon that spammers don't.

The feedback required for this customization must still come from clients, but it doesn't have to include every user on a network. Appliance manufacturer CipherTrust recommends that its Bayesian client be given only to those users who you trust to make network-wide decisions. The weakness here is that the trusted users may not be a representative sample: Senior executives don't always discuss the same specialized topics as engineers and accountants.

Another approach to feedback, used by anti-spam vendor GFI, is to base the definition of legitimate mail on an SMTP server's outbound mail. Unless you're a spammer, incoming e-mails that resemble those you send are likely to be legitimate. However, not every network's incoming and outgoing mail are well-matched, so this can lead to false positives. It's a particular problem for users who lurk on e-mail discussion groups such as those operated by Yahoo, or have subscribed to genuine mailing lists. Network Magazine editors never send press releases, but receive thousands of them each week.

Appliance-based Bayesian filters are unlikely to be as effective as those installed on every client, but they could still be a valuable weapon in the arms race against spam. A better solution may be to rely on several lines of defense: Use blacklisting at the server, and Bayesian filters on every client.

The risk is that as Bayesian filters become more commonplace, spammers will learn to adapt. Bayesian filtering is very computationally intensive, which makes it vulnerable to Denial of Service (DoS) attacks. So even though spammers can't customize messages for each user, they could increase the computational load on a filtering system by appending many random tokens to the end of each message. While spam obfuscated in this manner might reduce its effectiveness as a sales tool, the spammers probably won't give up. They already obfuscate messages to get around ordinary blacklists, and respond to reduced effectiveness by sending even more e-mail.


From Spam to Security?

The early success of Bayesian filtering against spam has led some researchers to look at other potential applications. In theory, Bayesian filtering would be a good way to block banner and pop-up ads, but in practice no one has tried this. That's because such ads don't pose as serious a problem as spam, and most can already be blocked using simpler methods. Bayesian filtering is also a poor choice for content censorship because it would have to be trained to fit each reader to be effective.

Bayesian techniques show more promise in other security fields. They may eventually be useful in intrusion detection, helping to distinguish black hats from legitimate users, and DoS attacks from ordinary traffic spikes. And although designed to fight spam, existing Bayesian filters already shield users from the bombardment of e-mails generated by Windows worms and viruses. They can't replace anti-virus software, but they do prevent users from having to delete either multiple copies of a virus or the bounce messages that some gateways send to people whose e-mail addresses have been forged by a virus. Both contain very few tokens representative of legitimate e-mail, so they're filtered out in the same way as spam.


Andy Dornan is senior editor of Network Magazine. He can be reached at adornan@cmp.com.


.

Free CallCenter Insider Newsletter

Your Email Address


Optional Areas of Interest
International News
Advice/Tips
Technology
Agent Development
IVR