Martian Software, Inc. logo

"Using Statistics to Cause Spammers Pain"

February 28, 2003

Update September 5, 2006: I was surprised to see this article come up on reddit today. It seems it has been rediscovered, and my failure to put a date on this article (since corrected) made it look new. This was written in 2003 and I presented the idea and my experiences with it at the 2004 Spam Conference. The project itself is no longer active. The spam environment has changed significantly since 2003, strengthening arguments against this approach. Still, it's a fun idea that might work. :) Thanks for your interest. - Marty

Ever since Paul Graham published "A Plan for Spam" in August 2002 (prerequisite reading for this article), a lot of people have spent a great deal of time applying statistical methods to automatically classify email messages as spam. Generally, spam identification is a hard problem to solve given that the definition of spam can differ from person to person. Messages erroneously classified as spam, known as "false positives," are pretty much intolerable, which further compounds the problem. Statisitical classifiers show great promise in this area as they are able to automatically adjust to handle personal definitions of spam. The odd false positive shows up from time to time, but these become few and far between as the local statistical model continues to improve.

These classifiers already come in many forms. There are POP3 proxies, IMAP proxies, mail file processors, and even classifiers built directly into mail clients. I use POPFile (a naïve Bayesian classifier in a POP3 proxy) at home with great success. Some work better than others, but with a little training, they all seem to work pretty well. Unfortunately, they have a common shortcoming: They don't cause the spammers any pain.

And we all want to cause spammers pain.

None of these classifiers are capable of causing the spammers any pain because the spammer is long gone by the time the classifier has the opportunity to process the message. What we need is a way to use the classifier against the spammer while the spammer is still connected.

All the classifiers I've seen operate on one message at a time. In essence, an e-mail message goes into the classifier after it has been received by the mail server, and a "spamminess" indicator (a number between 0 and 1 representing the probability that the message is spam) is returned.

I suggest that, with some modification, these holistic classifiers could be converted into incremental classifiers - enabling some pretty neat stuff.

An email message would be fed to the incremental classifier one piece (or token) at a time. After analyzing each piece, the classifier would respond with a classification of the message so far, providing a running probability that the message is spam.

Now imagine that your SMTP server has been provided with an incremental classifier. You now have a real-time indicator that a message is likely spam _while the spammer is connected_.

The Idea

Now we have moved identification of spam to the time of its receipt. But how can the SMTP server best use this knowledge? I propose that the running probability from the classifier be used to throttle the connection with the offending server. If an incoming message looks like spam [1], the connection could be slowed dramatically, consuming the spammer's resources and wasting their time [2]. This would transform the server into a sort of dynamic tarpit, in which the spamminess of the incoming message affects the viscosity of the tar [3]. As the spam probability goes up, the socket speed goes down [4]...

If enough of these dynamic tarpits were in place (or just a handful were placed in the right places), the spammers' mail software would bog down, reducing the rate at which they can send messages, in turn reducing the fees they can charge their customers. If these tarpits were ubiquitous, they could completely change the economics of spam, creating a scarcity of bandwidth experienced only by spammers.

Ouch. :)

Early detection of incoming spam could be used to create an artificial scarcity of bandwidth experienced only by spammers.

Of course, this doesn't eliminate the problem presented by careless or uninformed sysadmins who provide open relays; spammers going through open relays wouldn't feel a thing. But the sysadmins that provide those services certainly would. It isn't too far-fetched to imagine that they might start to notice the effect on their server when all of its outbound SMTP connections are slowed to a crawl.

TarProxy

So now I'm implementing such a tarpit as an SMTP proxy, dubbed "TarProxy." As a proxy, it won't require anyone to replace their mail server. My goals for TarProxy are that it must be:

  1. Free: It's no good unless it's everywhere... or at least in lots of places. TarProxy is Open Source Software released under a BSD-style license and available on SourceForge (see project page for details).
  2. Platform Independent: TarProxy is written in Java, so it runs on Linux, Windows, Solaris, OS X, and any other operating system with a Java Virtual Machine available.
  3. Pluggable: there's still a lot of experimentation taking place to determine what statistical functions and token extraction algorithms are most effective. It would be foolish to commit to any single one in TarProxy; indeed, the more such algorithms exist, the harder it will be for spammers to adapt. The classifier and tokenizer are therefore separate pluggable modules. Special adapter modules are available to communicate with existing classifiers that are not written in Java.
  4. Even More Pluggable: more experimentation will be necessary to determine what kind of throttling behavior is appropriate; in a nutshell, given a spam probability, how fast should the mail be accepted? The throttle is also a pluggable module.
  5. Even More Even More Pluggable: Because it is the connection that is tarpitted, and not an individual message, different strategies will be tested to assign a spamminess value to connections. Some possible strategies are: average spamminess/message, spamminess/byte, spamminess of current message only. These are also implemented as pluggable modules.
  6. Tolerant of False Positives: TarProxy by default never prevents legitimate mail from being received. This is achieved through the simple strategy of never preventing any mail from being received. The default configuration throttles connections down to a minimum that will still result in the eventual delivery of all mail. Of course, users are free to change this.
  7. Moderately Scaleable: I don't expect TarProxy to be something the engineers at HotMail can just drop into their system, but I do intend for it to perform sufficiently well for production use at small & medium-sized companies.
  8. Smart Enough To Know It Should Back Off: TarProxy should never tarpit so many connections that it can't receive legitimate mail. It is therefore configurable to allow a set number of unthrottled connections. For example, if 100 simultaneous connections are permitted, and 10 unthrottled connections are permitted, then the 90 "spammiest" connections will be tarpitted. If all of its connections are consumed, the spammiest connections (above a threshold value) can be dropped as well.
  9. Verbose in its Logging: Given the pluggable approach described above, quite a few combinations of strategies are possible. TarProxy provides great detail in its logs in order to allow analysis of strategies.

An alpha release of TarProxy is expected to be available soon - probably in the third week of March. How can you help?

  • TarProxy can certainly benefit from code review and contributions. There's a skeletal project page up now where updates will be published, and code will be available via SourceForge if you're willing and able to help in those areas.
  • If you've written a classifier, the project page will describe the interfaces expected of incremental classifiers by TarProxy. These interfaces are designed to make the adaptation of existing classifiers as easy as possible; if they present any problems for you, please let me know.
  • If you're not yet using any of the great client-side mail classification tools out there already, start doing so to build your corpus and provide those developers with feedback.

Notes

[1] The corpus used by the classifier should either be the same corpus used on the client side (by applications such as POPFile), perhaps even based on the intended recipient of the message, or perhaps an aggregate corpus for all users of the mail server. Using a single shared corpus gives spammers an easy tool to use for testing their messages.

[2] Technically, what I'm describing does not require that the classifiers be incremental - a system could simply accept a message in its entirety, determine its probability of being spam, and wait for a computed amount of time before acknowledging the message. I suspect, however, that this approach would be easily detected by spamming software and therefore not waste enough of their time. Their mailers could give up quickly and move on to other victims. The altruism of holding on to connections as long as you can without impacting the flow of "good" mail is critical to this system's success, so any clues that you're running such a server should be kept to a minimum.

[3] The tarpit concept is not new; probably the best-known tarpit program is LaBrea, which uses the technique of providing incredibly slow connections to remote servers with no business connecting to it, such as those that have performed a port scan to find open mail relays or to infect other machines with Code Red. LaBrea is different from what I'm describing here because it relies on the assumption that there's NO legimate reason to connect to it, and does not base its behavior on the content of the data it receives. Because of its effectiveness in handling open relay searches and slowing down malicious code in general, it's a valuable tool that should be considered a complement to what's described in this document.

[4] A far simpler approach, and one that I suspect would be far less effective, would be to ungracefully terminate connections from mail servers sending messages that appear to be spam. This would create a huge false positive problem - you would never receive any emails that are erroneously marked as spam. And dropped connections don't inflict pain on the spammer.

Acknowledgements

Many thanks to Joe Parks for helping me bounce the concept around, and to John Mutchek for reviewing this document and making several excellent suggestions regarding TarProxy's implementation.