| Home | | JSAP | Nailgun | Lab | Retired | Contact | About |
|
|
|
|
Using Statistics to Cause Spammers PainFebruary 28, 2003
Update September 5, 2006: I was surprised to see this article come up
on reddit today. It seems it has been rediscovered,
and my failure to put a date on this article (since corrected) made it look
new. This was written in 2003 and I presented the idea and my experiences
with it at the 2004 Spam Conference. The project
itself is no longer active. The spam environment has changed significantly
since 2003, strengthening arguments against this approach. Still, it's a
fun idea that might work. :) Thanks for your interest. - Marty
Ever since Paul Graham published "A Plan for Spam" in August 2002 (prerequisite reading for this article), a lot of people have spent a great deal of time applying statistical methods to automatically classify email messages as spam. Generally, spam identification is a hard problem to solve given that the definition of spam can differ from person to person. Messages erroneously classified as spam, known as "false positives," are pretty much intolerable, which further compounds the problem. Statisitical classifiers show great promise in this area as they are able to automatically adjust to handle personal definitions of spam. The odd false positive shows up from time to time, but these become few and far between as the local s tatistical model continues to improve. These classifiers already come in many forms. There are POP3 proxies, IMAP proxies, mail file processors, and even classifiers built directly into mail clients. I use POPFile (a naïve Bayesian classifier in a POP3 proxy) at home with great success. Some work better than others, but with a little training, they all seem to work pretty well. Unfortunately, they have a common shortcoming: _They don't cause the spammers any pain._ And we all want to cause spammers pain. None of these classifiers are capable of causing the spammers any pain because the spammer is long gone by the time the classifier has the opportunity to process the message. What we need is a way to use the classifier against the spammer while the spammer is still connected. All the classifiers I've seen operate on one message at a time. In essence, an e-mail message goes into the classifier after it has been received by the mail server, and a "spamminess" indicator (a number between 0 and 1 representing the probability that the message is spam) is returned. I suggest that, with some modification, these holistic classifiers could be converted into incremental classifiers - enabling some pretty neat stuff. An email message would be fed to the incremental classifier one piece (or token) at a time. After analyzing each piece, the classifier would respond with a classification of the message so far, providing a running probability that the message is spam. Now imagine that your SMTP server has been provided with an incremental classifier. You now have a real-time indicator that a message is likely spam while the spammer is connected. The IdeaNow we have moved identification of spam to the time of its receipt. But how can the SMTP server best use this knowledge? I propose that the running probability from the classifier be used to throttle the connection with the offending server. If an incoming message looks like spam [1], the connection could be slowed dramatically, consuming the spammer's resources and wasting their time [2]. This would transform the server into a sort of dynamic tarpit, in which the spamminess of the incoming message affects the viscosity of the tar [3]. As the spam probability goes up, the socket speed goes down [4]... If enough of these dynamic tarpits were in place (or just a handful were placed in the right places), the spammers' mail software would bog down, reducing the rate at which they can send messages, in turn reducing the fees they can charge their customers. If these tarpits were ubiquitous, they could completely change the economics of spam, creating a scarcity of bandwidth experienced only by spammers. Ouch. :) Early detection of incoming spam could be used to create an artificial scarcity of bandwidth experienced only by spammers. Of course, this doesn't eliminate the problem presented by careless or uninformed sysadmins who provide open relays; spammers going through open relays wouldn't feel a thing. But the sysadmins that provide those services certainly would. It isn't too far-fetched to imagine that they might start to notice the effect on their server when all of its outbound SMTP connections are slowed to a crawl. TarProxySo now I'm implementing such a tarpit as an SMTP proxy, dubbed "TarProxy." As a proxy, it won't require anyone to replace their mail server. My goals for TarProxy are that it must be:
An alpha release of TarProxy is expected to be available soon - probably in the third week of March. How can you help?
Notes[1] The corpus used by the classifier should either be the same corpus used on the client side (by applications such as POPFile), perhaps even based on the intended recipient of the message, or perhaps an aggregate corpus for all users of the mail server. Using a single shared corpus gives spammers an easy tool to use for testing their messages. [2] Technically, what I'm describing does not require that the classifiers be incremental - a system could simply accept a message in its entirety, determine its probability of being spam, and wait for a computed amount of time before acknowledging the message. I suspect, however, that this approach would be easily detected by spamming software and therefore not waste enough of their time. Their mailers could give up quickly and move on to other victims. The altruism of holding on to connections as long as you can without impacting the flow of "good" mail is critical to this system's success, so any clues that you're running such a server should be kept to a minimum. [3] The tarpit concept is not new; probably the best-known tarpit program is LaBrea, which uses the technique of providing incredibly slow connections to remote servers with no business connecting to it, such as those that have performed a port scan to find open mail relays or to infect other machines with Code Red. LaBrea is different from what I'm describing here because it relies on the assumption that there's NO legimate reason to connect to it, and does not base its behavior on the content of the data it receives. Because of its effectiveness in handling open relay searches and slowing down malicious code in general, it's a valuable tool that should be considered a complement to what's described in this document. [4] A far simpler approach, and one that I suspect would be far less effective, would be to ungracefully terminate connections from mail servers sending messages that appear to be spam. This would create a huge false positive problem - you would never receive any emails that are erroneously marked as spam. And dropped connections don't inflict pain on the spammer. AcknowledgementsMany thanks to Joe Parks for helping me bounce the concept around, and to John Mutchek for reviewing this document and making several excellent suggestions regarding TarProxy's implementation. |
|