May 24, 2004

It was bound to happen...

Spammers have managed to pollute my bayesian filters. Dammit. And not in a safe way.

Just because, I decided to take a look at my spam folder. Normally I don't, I just unconditionally delete the damn thing. Today... well, I figured I'd look. Couldn't hurt, right? Only 10K messages, about 91M of crud, and it's not like there are words I've never seen in it.

Unfortunately stuff in there did upset me. Not because it was foul, obscene, tasteless, or lacking in any sort of aesthetic. Nope, it's because there was mail I recognized. For some reason the autolearning that SpamAssassin does picked up on some regular mailing list mail and mail from people I know. On the first page. Making me very unhappy, because now I need to wade through all the crap. (Well, only 6.5K more messages, though now I need to clean my shoes) Damn.

Topping it off, for some reason SpamAssassin thinks there's exactly one message in my training ham folder. (There isn't, there's about 25) And 1 message in the training spam folder (nope--about 2K). So something's busted there too. Gah. I hate software.

So, if you've not heard from me, don't be too surprised, and try again--you might've gotten caught by a rogue bayesian network.

Posted by Dan at May 24, 2004 06:05 PM | TrackBack (1)
Comments

The "1 message in training ham folder" thing is because it's already learned from those (probably), so there's only 1 it hasn't already learned.

bad news, though... I'd definitely suggest learning more ham, as 200 at least is about right for good manual learning -- and auto-learning is proving tricky these days with the amount of Bayes poisoning the spammers are using.

Posted by: Justin at May 25, 2004 02:20 AM

Auto learning sounds like a pretty bad idea for bayes filters.

I've had great (make that _amazing_) success with the spambayes tools. And soon I'll be able to run them on parrot, right? :-)


- ask

Posted by: Ask Bjørn Hansen at May 25, 2004 02:58 AM

For the one message only problem, did you pass --mbox to sa-learn? Otherwise it will assume it's only being given a single message.

The essential tip: configure SpamAssassin to put the spam score in the beginning of the subject of recognized spam, and then sort your spam folder by subject. This way the lowest scoring messages will float to the top, making it easy to ignore the morass of certain spam. My .spamassassin/user_prefs says

subject_tag (Score: _HITS_)

If your MUA has a powerful enough scoring feature, you might be able to do scoring by the SpamAssassin headers, and then sort by score. mutt does not, unfortunately, so I'm using the aforementioned tagging and subject sorting.

Posted by: Aristotle Pagaltzis at May 25, 2004 08:16 AM

The one message in the ham folder did turn out to be a lack of --mbox switch on the sa-learn command line. Weird, as I don't remember ever having done that before. (But, on the other hand, it's been quite a long time since I explicitly trained the filters)

I suppose I could have spamassassin mush the subject line to make the spam folder easier to deal with, but given how much of it there is, and how much has characters that give my terminal fits when I look at it (I use pine to deal with my spam folder as it's lower-overhead than grabbing it via Eudora) I'm not sure that I'd see it often enough for it to be useful. Might be worth fiddling with, though.

Posted by: Dan at May 25, 2004 09:37 AM

That cheap subject trick has saved me hours upon hours of spam box sifting. I can't imagine living without it anymore. The first time you do it with several false positives in the box and they just magically appear in a single cluster at the top is like a revelation.

Posted by: Aristotle Pagaltzis at May 26, 2004 08:34 AM