11 Dec 2003

Hassled by SPAM? Here's a good solution .. and its open source !

Note: This article has been moved over from my previous blog for historical purposes.

I have several mail POP accounts which i operate both from home and office. Of late, most of these accounts have experienced a radical growth in the number of spam mails i have received. According to my calculations, about a whopping 87% of the mails i receive are spam.
Web-mail services like Yahoo!, Rediff, etc have effective been blocking a large part of spam. I have been searching for something equally effective that would help me solve or atleast reduce my spam problems. On a personal basis, i tried my level best with writing Outlook mail rules which did reduce the problem by a 5 percent. But i could not find a solution.
Recently, i ran into this open source software - POPFile. I am happy to say that it has solved a large portion of my problem - about 95% !! Here's how it works (source: popfiledocumentation project)
  1. It works in the form of a proxy which sits between the mail client (Outlook, Eudora, Netscape, etc) and your POP mail server
  2. Commands generated by the email client are passed via the software to the server
  3. As messages are retrieved, the software reads them and classifies them to user defined categories (eg: Personal, SPAM, Office, From my Mom, etc)
  4. The messages are marked accordingly and passed on to your email client.
  5. The email client reads these marks (Custom Mail Header in techie terms) and can be configured to act accordingly (see how to configure your email client).
For example, i have configured my mail folder to deliver all mails except those categorized as SPAM to various folders. Mails marked SPAM are shown the trash can. However, just to be safe, i scan through the trash scan quickly to see if something got in there by mistake.
Popfile manual says it uses a old mathematical theorum called Bayes Theorum to sort out the mails accurately.  However, the sorting does not come automatically - the software needs to be trained. 

So the question "Trained !? How do u train a software !" comes next - right ? Well its quite simple, when initially installed it classifies all mails as regular mails. U need to spend some time using the popfile application to tell the software to "re-classify" these mails as say spam, personal, office, etc. Popfile analyzes these mails by breaking it down into words, parsing attachments, filtering out HTML and then uses the above mentioned algorithm to sort out the mails into what it calls "buckets" (u may think of them as logical slots). 

So the next time such a similiar mail comes through, POPFile automatically classifies it. Whatever it misses, u can point out. With me, this took a few days. The first day took quite a bit of time. But as time passed, i had to (re)classify fewer and fewer mails. POPFile got more and more accurate. The application shows how accuracte it is - depending on how many mails u re-classify. It also shows a percentile breakup of classifications (thats where the percentages i mentioned at the start come from).

As of today, i have stopped teaching POPFile and it functions entirely without additional inputs. I do find a stray mail here and there. So its almost like a competition between me and the software :)So would i recommend that u use the software - a definite YES !! And did i tell you its a open source project ?! My congrats to the POPFile team on a job well done ! U can visit the POPFile home at http://popfile.sourceforge.net/

Here is some more reading material for the technically inclined: 
  • A plan for SPAM - This article describes the spam-filtering techniques used in the new spamproof web-based mail reader. The authors are building to exercise Arc. An improved algorithm is described in Better Bayesian Filtering.
  •  Better Bayesian Filtering - This article was given as a talk at the 2003 Spam Conference. It describes the work done to improve the performance of the algorithm described in A Plan for Spam, and plans for the future

More ...

West Nile virus caught on camera (say cheese man!)

Note: This article has been moved over from my previous blog for historical purposes

[WNV (heavy - may take sometime to load) The West Nile Wirus (WNV) [What is West Nile Virus], first isolated from a febrile adult woman in the West Nile District of Uganda in 1937, is the cause of a serious seasonal epidemic in North America that flares up in Summer and continues into the fall. Generally spread by Mosquitoes, Transfusions and Transplants. The year saw 4156 reported cases in America alone. The most serious manifestation of WN virus infection is fatal encephalitis (inflammation of the brain) in humans and horses, as well as mortality in certain domestic and wild birds. (detailed info)
In one of the first look WNV, researchers at purdue university have recently released a 3 dimensional image that appears as what they call bumpy gum ball. The Purdue team found the virus to be about two millionths of an inch wide (a little over one millionths of a centimetre wide) - small even in the minuscule realm of viruses. Here's how it looks !

Understanding the precise orientation of its proteins - the Purdue team's next goal - could speed the development of drugs to thwart its ability to infect cells in birds, humans, horses and other animals.

10 Dec 2003

"GPL designed so people receive value of GPL-copyrighted works in return their own contributions" says Linus

Note: This article has been moved over from my previous blog for historical purposes

Linus torwald recently penned (err.. make that keyed) in his reply to the open letter written by Darl McBride (CEO, SCO Group) on SCOs website arguing that linux hackers were threatening to undermine copyright protections provided by US and European laws.

Several voices (Matt Hines, Robert McMillan, Robyn Weisman, etc) have already come up again stating various opinions about the affair. SCO may have gained it a investment of $50 million to,  but as Tom Tulli puts it "The Constitution will survive this dispute. The big question is whether as much can be said for SCO Group investors."

5 Dec 2003

Michael Smith Genome Sciences Centre uses Linux to sequence SARS virus in 5 days !

Note: This article has been moved over from my previous blog for historical purposes
 

I have been tinkering about with linux for a long time now. It started with a local magazine (PCQuest) distributing free CDs (again a first in India) of linux then. A few months later i was to learn of and join PLUG (Pune Linux Users Group) - initially as a passive listener. It was during PLUGs online meeting (we used to meet on IRC then) that i learnt of a certain linuxjournal.com.
I have been hooked to the magazine ever since. Its been a long time since then. Today, i am a avid supporter and advocate of OSS. I not only actively use open source softwares, but have also put in some humble contributions of my own.

This particular article ("Sequencing the SARS Virus") in the Nov 2003 issue caught not only my attention, but also my fancy. The author writes about the B.C labs use of Linux to extract a sequence of the SARS virus in 5 days !! Here's the groups site (see: "Around the lab") that displays a annotated image of the lab with linux equipment that was used for the feat !
The article describes in detail how open source softwares such a Linux, Python, MySQL, LIMS, Perl, Sockeye 3D genome viewer and many more were used to sequence the genome of SARS Tor2 Coronavirus (see map) in 5 days. 

Reading that article was enough to convince me that not only was open source here to say, it has already made in-roads where commercial software were yet to even step-in. I also spent sometime scourging the net for more such examples, i wasnt surprised to find many more - though not so famous !

I have already known Perl to have been on the forefront of Bio-informatics. Today countless books, libraries have been written on and in perl on the subject. Several universities have been long using perl for this purpose. See this article "How Perl saved the human genome project" by Lincoln Stein (website)

I, personally believe, that open source by its very nature is realizing its potential that even the likes or Richard Stallman (of the famed GNU) had never imagined. I would even dare say that we have not even scratched the surface.

My hats off to the folks at Michael Smith Genome Sciences Centre for not only a job well done but also for how it was done. Way to go folks !! I am already passing on the linux of your work to a few of my friends working in the area of application. Most of them have already shown interest in using Linux. I am hoping that your example has put it into their "do-able" category of things.