"Look out honey, 'cause I'm using technology..."

2008-11-01

A better anti-spam method

Recently an idea on how to combat spam occurred to me, that I haven't heard before. It's very low tech, which I find attractive, and it would never result in false positives, which is the most important shortcoming of all the systems I've tried up until now.

What if everyone in the world had an additional email address, that they would never ever use or give out, but that would be publicized in places where only email address harvesting bots would encounter them. A bit of handwaving here, but in its simplest form, just put your real and your spam email address on your webpage, completely unobfuscated, (perhaps even in mailto: links!) but stipulating (in a clear, but not easily machine readable way,) that people should not send mail to the fake/spam one should be enough.

Then *any* message that arrives in the spam account is necessarily spam. You can now use that account to filter your real account, by removing messages with a body that is identical, or similare enough. Again some handwaving on similar enough, (do this wrong, and voilĂ : false positives again) but you get the drift.

This kind of filter could be implemented by a provider, where the user would not have to do anything manually, except putting the fake address out there for the harvesters to find, or it could be implemented client side, where the mail client gets all mail from both accounts, and does its thing.

And another thing: I also think the current bayesian filters could be improved upon, by recognizing more/different patterns than just lexical ones. I have an intuition that character based markov chaining could catch a lot of spam I get: I built a small script in college which could reliably distinguish quite a number of languages. That would get rid of all the mail addressed to me in languages I cannot read, which I would classify as spam. Taken further this could also get rid of intentionally misspelt spam, to the detriment of poor spellers that want to send me legitimate mail, or (specific patterns of) html markup in mails, which would get rid of most all the other tricks one could use to show the word 'viagra', without actually writing it.

I might have a go at this, to see if my intuition that these could be better predictors than word counts is correct. (I would train my dutch email and my english email into separate 'ham languages', and everything showing entire unpredicted character sequences, would be unsure/spam. Marking things as spam could train a 'spam' language, so there could be both positive and negative indicators.)

2 comments:

Vincent said...

Interesting. This idea has already been used to add noise to a spam database (first hit on Google), but not to analyze.

Unknown said...

Yes, I know about this kind of honeypotting (as I think it's called,) but since most of the spamming is done by botnets, I don't think the signal to noise ratio of email address matters much to the spammers: they just hit every address, whether it exists or not, at very little to no extra cost. In a system where we *want* them to send spam to our fake address, this works to our advantage. :)