Mayuresh Kadu's techie blog: Hassled by SPAM? Here's a good solution .. and its open source !

Note: This article has been moved over from my previous blog for historical purposes.

I have several mail POP accounts which i operate both from home and office. Of late, most of these accounts have experienced a radical growth in the number of spam mails i have received. According to my calculations, about a whopping 87% of the mails i receive are spam.
Web-mail services like Yahoo!, Rediff, etc have effective been blocking a large part of spam. I have been searching for something equally effective that would help me solve or atleast reduce my spam problems. On a personal basis, i tried my level best with writing Outlook mail rules which did reduce the problem by a 5 percent. But i could not find a solution.
Recently, i ran into this open source software - POPFile. I am happy to say that it has solved a large portion of my problem - about 95% !! Here's how it works (source: popfiledocumentation project)

It works in the form of a proxy which sits between the mail client (Outlook, Eudora, Netscape, etc) and your POP mail server
Commands generated by the email client are passed via the software to the server
As messages are retrieved, the software reads them and classifies them to user defined categories (eg: Personal, SPAM, Office, From my Mom, etc)
The messages are marked accordingly and passed on to your email client.
The email client reads these marks (Custom Mail Header in techie terms) and can be configured to act accordingly (see how to configure your email client).

For example, i have configured my mail folder to deliver all mails except those categorized as SPAM to various folders. Mails marked SPAM are shown the trash can. However, just to be safe, i scan through the trash scan quickly to see if something got in there by mistake.
Popfile manual says it uses a old mathematical theorum called Bayes Theorum to sort out the mails accurately. However, the sorting does not come automatically - the software needs to be trained.

So the question "Trained !? How do u train a software !" comes next - right ? Well its quite simple, when initially installed it classifies all mails as regular mails. U need to spend some time using the popfile application to tell the software to "re-classify" these mails as say spam, personal, office, etc. Popfile analyzes these mails by breaking it down into words, parsing attachments, filtering out HTML and then uses the above mentioned algorithm to sort out the mails into what it calls "buckets" (u may think of them as logical slots).

So the next time such a similiar mail comes through, POPFile automatically classifies it. Whatever it misses, u can point out. With me, this took a few days. The first day took quite a bit of time. But as time passed, i had to (re)classify fewer and fewer mails. POPFile got more and more accurate. The application shows how accuracte it is - depending on how many mails u re-classify. It also shows a percentile breakup of classifications (thats where the percentages i mentioned at the start come from).

As of today, i have stopped teaching POPFile and it functions entirely without additional inputs. I do find a stray mail here and there. So its almost like a competition between me and the software :)So would i recommend that u use the software - a definite YES !! And did i tell you its a open source project ?! My congrats to the POPFile team on a job well done ! U can visit the POPFile home at http://popfile.sourceforge.net/

Here is some more reading material for the technically inclined:

A plan for SPAM - This article describes the spam-filtering techniques used in the new spamproof web-based mail reader. The authors are building to exercise Arc. An improved algorithm is described in Better Bayesian Filtering.
Better Bayesian Filtering - This article was given as a talk at the 2003 Spam Conference. It describes the work done to improve the performance of the algorithm described in A Plan for Spam, and plans for the future

More ...

11 Dec 2003

Hassled by SPAM? Here's a good solution .. and its open source !