Some notes on installing SpamAssassin

Spamassassin is free software that tags spam mail. It is very powerful, using fuzzy checks that can rely upon other spam tagging systems. Its documentation is not perfecty clear, so I set up a this page explaining my experience with it, which may be useful to other people using it. References are made to spamassassin version 2.42.

Keep in mind that:

--max-children
spamd has a limit on the maximum number of children spawned. On a Sun Ultra60 with 512MB memory, I found that 20 is a reasonable number, and maybe it could be increased. In fact, the memory footprint of a single Perl interpreter for spamd is about 20MB, but the total memory occupied by several concurrent spamd processes is not much higher. In peak activity periods, with load average around 15, more than 13 spamd processes running or sleeping, and many other amavis and sendmail processes active, the total memory used was around 350MB, plus about 200MB on swap.
required_hits
On a system running for about 500 users, I set required_hits to 5.7, for marking, and set a limit of 15 (by looking at X-Spam-Level with procmail) for automatic discarding on a quarantine directory.
Symbolic names
A dash is not allowed in symbolic names. By the way, the clear-terse-report-template command mentioned in the manual should be clear_terse_report_template instead.
Spamcop
The README says that you must pay for it. In fact, it's free, in the sense that no accounting is set up. The maintainers ask for a voluntary contribution if you use it for non personal use.
RBL
The README says that it is free for personal use. In fact, you must subscribe to it, asking for a free access for personal use. There are fees for sites with many users.
Rbl checks
To see how they work, use `spamassassin -D rbl=-3'. This will display all debug messages related to rbl checking.
Whitelist globbing
The manual says that globbing is allowed. Notice that only asterisks are allowed, that behave as in the shell, but question marks and characters in brackets do not work.
Where do whitelist and blacklist look for addresses
Look at Mail::SpamAssassin::EvalTests::all_from_addrs. The way they work in 2.42 is: first look for a Resent-From header; if it is there, do not look further; else, look for all addresses in the following headers: From, Sender, Envelope-Sender, Resent-Sender, X-Envelope-From, Return-Path.
How do whitelist_from_rcvd looks for the rcvd address
Look at Mail::SpamAssassin::EvalTests::_check_whitelist_rcvd. The from address is looked as explained above, the rcvd address is looked in all the Received headers, does not allow globbing, and must be followed by a numeric IP address in brackets.
Auto whitelist
Be very careful with it. A single message tagged as spam from a correspondent may taint him for a long time, and mark its subsequent messages as spam. Bootstrap it by initially setting `auto_whitelist_factor 0', so that while the database is being built you do not use it. Then try and enable it by setting the factor to something nearer to 0.5, but first check the database using the program tools/check_whitelist.

The algorithm works using a database of entries. Each entry has a key formed by the From: address and the IP address (which IP?), and contains a TOTAL score and a COUNT number. The MEAN score is TOTAL/COUNT. The current algorithm works as follows:

  1. Compute the SCORE of the message without AWL (auto-whitelist)
  2. Compute AWL DELTA as (MEAN-SCORE)*auto_whitelist_factor
  3. Increment TOTAL by SCORE [see note below]
  4. Increment COUNT by one
  5. Set the final score of the message to SCORE+DELTA

NOTE for version 2.42 only: in 2.42 the third step is in fact:

  1. Increment TOTAL by SCORE-5*COUNT The logic behind the (-5*COUNT) part is flawed. Its effect is simply to lower the MEAN by about 2.5*COUNT, in an obfuscated way. This may be good for personal whitelists, because it gives a boost to long-term correspondent, as noticed in a comment in the source, but turns out to be a disaster for centralised auto whitelists, where a single spammer is advantaged by sending more spam.

    Versions 2.41 and 2.43 do not exhibit this problem. So do not use 2.42 with auto-whitelist enabled for a centralised autowhitelist, or just go to AutoWhitelist.pm and delete the offending line (the one with the 5 multiplier) end the relative comment.

    Apart from this bug, I am not yet able to tell whether auto whitelisting is desirable or not. I am currently keeping it disabled both for my personal spamassassin and for the centralised one I installed.

    tools/check_whitelist
    Has a small printout bug. This is a correct printf statement. Edit the program and substitute the printf statement.
         printf "% 8.1f %15s  --  %s\n",$t/$v,(sprintf "(%.1f/%d)",$t/$v,$v),$key;
    
    Meta rules
    What the manual does not say is that a meta rule can only depend on non-meta rules.
    Spam phrases
    Mail::SpamAssassin::PhraseFreqs::_check_phrase_freqs does its search on the Subject and body of the message. First it keeps only English alphabetical characters, everything else is taken as a word delimiter. Then it downcases everything. Then it deletes all words of length one and two, plus "and" and "the". Then it checks for every pair of consecutive words and adds the score for each pair found in the spamphrase list. The total sum is divided by spamphrase_highest_score, then divided by 10, then normalized to a message 200 words long: if the number of words examined is greater than 200, the total sum is divided by (number of words / 200).
    Dialup codes man page
    Here is the relevant man page section formatted in a readable way. Notice that there are very long lines in comments.
    dialup_codes { "domain1" => "127.0.x.y", "domain2" => "127.0.a.b"
    Default:
    { "dialups.mail-abuse.org." => "127.0.0.3", # For DUL + other codes, we ignore that it's on DUL
      "rbl-plus.mail-abuse.org." => "127.0.0.2",
      "relays.osirusoft.com." => "127.0.0.3" };
    
    WARNING!!! When passing a reference to a hash, you need
    to put the whole hash in one line for the parser to read
    it correctly (you can check with "spamassassin -D <
    mesg")
    
    Set this to what your RBLs return for dialup IPs It is
    used by dialup-firsthop and relay-firsthop rules so that
    you can match DUL codes and compensate DUL checks with a
    negative score if the IP is a dialup IP the mail
    originated from and it was properly relayed by a hop
    before reaching you (hopefully not your secondary MX :-)
    The trailing "-firsthop" is magic, it's what triggers
    the RBL to only be run on the originating hop The idea
    is to not penalize (or penalize less) people who
    properly relayed through their ISP's mail server
    
    Here's an example showing the use of Osirusoft and MAPS
    DUL, as well as the use of check_two_rbl_results to
    compensate for a match in both RBLs
    
    header   RCVD_IN_DUL       rbleval:check_rbl('dialup', 'dialups.mail-abuse.org.')
    describe RCVD_IN_DUL       Received from dialup, see http://www.mail-abuse.org/dul/
    score    RCVD_IN_DUL       4
    
    header   X_RCVD_IN_DUL_FH  rbleval:check_rbl('dialup-firsthop', 'dialups.mail-abuse.org.')
    describe X_RCVD_IN_DUL_FH  Received from first hop dialup, see http://www.mail-abuse.org/dul/
    score    X_RCVD_IN_DUL_FH  -3
    
    header   RCVD_IN_OSIRUSOFT_COM rbleval:check_rbl('osirusoft', 'relays.osirusoft.com.')
    describe RCVD_IN_OSIRUSOFT_COM Received via an IP flagged in relays.osirusoft.com
    
    header   X_OSIRU_SPAM_SRC  rbleval:check_rbl_results_for('osirusoft', '127.0.0.4')
    describe X_OSIRU_SPAM_SRC  DNSBL: sender is Confirmed Spam Source, penalizing further
    score    X_OSIRU_SPAM_SRC  3.0
    
    header   X_OSIRU_SPAMWARE_SITE rbleval:check_rbl_results_for('osirusoft', '127.0.0.6')
    describe X_OSIRU_SPAMWARE_SITE DNSBL: sender is a Spamware site or vendor, penalizing furthe
    score    X_OSIRU_SPAMWARE_SITE 5.0
    
    header   X_OSIRU_DUL_FH    rbleval:check_rbl('osirusoft-dul-firsthop', 'relays.osirusoft.com.')
    describe X_OSIRU_DUL_FH    Received from first hop dialup listed in relays.osirusoft.com
    score    X_OSIRU_DUL_FH    -1.5
    
    header   Z_FUDGE_DUL_MAPS_OSIRU rblreseval:check_two_rbl_results('osirusoft', "127.0.0.3", 'dialup', "127.0.0.3")
    describe Z_FUDGE_DUL_MAPS_OSIRU Do not double penalize for MAPS DUL and Osirusoft DUL
    score    Z_FUDGE_DUL_MAPS_OSIRU -2
    
    header   Z_FUDGE_RELAY_OSIRU rblreseval:check_two_rbl_results('osirusoft', "127.0.0.2", 'relay', "127.0.0.2")
    describe Z_FUDGE_RELAY_OSIRU Do not double penalize for being an open relay on Osirusoft and another DNSBL
    score    Z_FUDGE_RELAY_OSIRU -2
    
    header   Z_FUDGE_DUL_OSIRU_FH rblreseval:check_two_rbl_results('osirusoft-dul-firsthop', "127.0.0.3", 'dialup-firsthop', "127.0.0.3")
    describe Z_FUDGE_DUL_OSIRU_FH Do not double compensate for MAPS DUL and Osirusoft DUL first hop dialup
    score    Z_FUDGE_DUL_OSIRU_FH 1.5