KLIRIK.NAROD.RU > MAIN ДОБАВИТЬ В ИЗБРАННОЕ | ПОМОЩЬ
 [ Главная ] [ Лютеранство ] [ Полезности ] [ Гуделки ] [ Ссылки ] [ Форум ] [ Гостевая ] [ Автора! ]  
 

Antispam BayesIt!

Last Update:

Zakladki.ru

Add site:

Your archive:

Ваш персональный архив на Закладках.ru

Другие места

Site map>

Add a feedback>

Discuss on forum>

Letter to author>

 

Antispam filter BayesIt!

Russian version here >>>

ATTENTION! It is deprecated document which is actual for the plugin up to version 0.4gm. Since the add-on BayesIt was included into official distribution of the last versions of The Bat!, the information page about installing and setting up the plugin (ver. 0.5.x and newer) is now placed at RitLabs website (see at section "Solution" on official RitLabs site http://www.ritlabs.com)

The antispam filter BayesIt is an add-on to be used with the "The Bat!" e-mail program. It contains the usage of statistical mail filtration technology, called “Bayesian” algorthyms. You can read more about this technology in the article “Filtering of the spam using Bayes method” (in Russian). Or you can find the article “A plan for spam” written by Paul Graham.

(If you've already read this section and only want to check the updates, you can go directly here. However please pay your attention to the fact that the text of the section can be changed from time to time — and if some questions arise which are answered in the text of the section, I'll just reference to the text instead of answer personally)

Since The Bat! 1.63 Beta 7 was introduced, it had the ability to add a plug-in external antispam filter, although these options were very limited. When The Bat! 2.00 was released, these possibilities were greatly enhanced. It became possible to execute filters requiring operations which were necessary to be compiled as a standalone application before.

BayesIt! is one of these filters which can effectively filter junk mail. I don’t intend to use the word “spam”, because the term “junk mail” differs for different people.

Some technical limitations of The Bat! Antispam plugins

BayesIt works with The Bat! using an API internal interface, and it knows nothing about posting protocols and tricks spammers might use in mail delivering. It just receives an e-mail from The Bat! and returns a "grade" for this specific e-mail, which corresponds to “junkiness” using a scale from 0 to 100. The filter doesn’t know where the letter is from or how it is recieved (via POP3 mail, IMAP4 through a TLS connection, or simply by “local delivery”). This is entirely done by The Bat!. The main advantage of using this method – all e-mail coming into your inbox can be filtered independantly from the method of delivery. On the other hand, the only task of the filter is returning a numerical value as a response. The filter can’t report anything else and it can’t actually change the e-mail so, it is impossible to add something like “***SPAM***” into the message’s subject line. This is the first limitation of this type of API.

The second limitation grows out of the idea that The Bat! asks the filter about the grade only after the letter is delivered. The Bat! downloads the e-mail from the server, and only then determines its "grade". This means that the work of this filter through “selective downloading” directly isn’t possible. In other words, you can only determine e-mail that it is “junk” after it received, not while it still resides on the mail server.

On the basis of the returned "grade" The Bat! can move this e-mail into “junk mail” or simply delete it (at least in the current 2.00 version of The Bat!). In the first option, there are some problems with internal structure of the “Junk folder”. It is an internally defined folder of The Bat!, and it ia created automatically the first time a "junk e-mail" needs to be moved here. This works exactly the same as the “Quarantine” folder, when you receive a virus in an e-mail, and an anti-viral plugin is found. The settings for these folders are very limited – you can just define where it goes and the expiration period. The main inconvenience is that junk mail will not be marked as read when it will come to this folder. Also, there is an annoying sound when an e-mail comes into this folder. The sound can be switched off using preferences. Undortunately, the only way you can process letters in the "Junk Folder" is manually. No internal methods to manipulate these emails, including sorting/filters, currently work with this folder. It would be great if future versions of The Bat! would allow these emails to be marked as read when they come to the folder.

All these limitations are the issues of The Bat! antispam plugin’s API.

Installation of the BayesIt! antispam plugin

The plug-in is intended for use with The Bat! ver 2.00 and up. For the previous beta-builds of the 1.63 series, you can use the previous version of the plugin (0.3a).

The plugin is updated constantly! I correct the bugs located, add new features and complete some unfinished parts. The reference (URL) to the last version and the short list of changes can be found at the bottom of the page.

Run the distribution file after you download it. You will be prompted to select the path where to install BayesIt (the "Program Files/BayesIt" by default) and the path where BayesIt will store it's reference statistical bases (for Windows NT/2K/XP default value is %APPDATA%/BayesIt, for Windows 9x/Me it is Program Files/Bayesit — the same place where the BayesIt installed).

The "smaller" distribution is the simple ZIP-archive which contains BAYESIT! files. If you use this distribution, you will need to run "install.bat" after you unpack the archive.

Then run The Bat!, and go to menu “Options/Preferences”. Select the “Anti-Spam” option and press the “Add” button.

In the dialog add the file: bayesit.tbp (it should be in “Program Files/Bayesit/Bayesit.tbp”). When you select the plugin, the wizard for initial training will automatically start. It is intended for initial configuring of the filter

Initial training consists of pointing some mail folders to the filter and determining the content of these folders. The filter processes this data and builds the statistical reference base for further e-mail referencing. The only requirement is that you MUST have some quantity of the letters of BOTH categories (“junk mail” and "good") In the opposite case the filter won't work! You need some 'junk e-mail' for this process to learn what IS indeed 'junk'. It is implied that “good” and “junk” letters are sorted to different folders, usually your “good” e-mails are sorted into different folders, and “junk mail” is simply dumped together into a folder or trash. Only incoming mail needs to be processed, so it is not necessary to process folders such as “Sent” or “Outbox”.

First, you need to select the folders which you are going to process. To do this click the button "Select The Bat! folders..." (also the alternative way of initial setup exists).

In the opened windows you can see the tree-view structure of your mailboxes, similar to the window of folders in The Bat! itself. On the left side of the icon of every folder there is a flag option to determine if it’s “junkiness”, "ok e-mail" or "empty", which can be toggled using mouse.

The empty flag means that the folder will not be processed. “Trash”, “Outbox” and “Sent” folders will not be processed by default (of course, you can select them, if you want). IMPORTANT: You DO NOT want to identify the Trash folder as "junk e-mail", as you would then potentially be telling the filter that good mail you deleted should be considered "junk e-mail" !!

The folders containing “good” letters are marked by a green flag. By default most of the folders are marked with this flag, except the Junk Folder.

The folders containing “junk e-mail” (your own definition) are marked with the red flag. By default all folders defined as “Junk e-mail” in all mailboxes should be marked by this flag. There should not be "good" mail in this folder, ONLY e-mail you want designated as "junk".

In the initial installation of the filter (when you have no "Junk mail” folders) you need to mark the folders which actually contain junk e-mail, and, if you wish, deselect folders which you don’t want to include into the filter’s statistical base. You should mark "Junk mail" folders that contain ONLY "junk mail". If necessary, create your own junk mail folder and move junk mail into this folder. Note: The more junk mail you provide in the initial set-up phase, the better the filter will work catching new junk mail.

If you have some folders which filter couldn’t find itself (for example, a common folder which is placed in the special location), and you wish to add it, you can do it using the button “Add folder…” (and select the file with .tbb extension)

After you select all folders you want to process, press the “Ready” button, you will return you back to the set-up wizard.

In the lower part of the windows you will see the number of “good” and “junk” folders, and you can edit or clear the lists of them, if you wish.

By pressing the button “Next” in the bottom of the wizard you will come to the next step of the wizard:

Here you can setup the special technology called “partial transliteration”. This technology is applied to all international language e-mail (for example, Russian) and it is intended for recognizing the words which are written in “mixed” mode – partially using local alphabet, partially using the latin charset (for example, the letters “A” and “А” in Russian and English look same, and there are also many such cases).

For this technology to work, the filter requires, the “local” user’s alphabet (for example, Russian), and, second, it needs to know which latin letters must be changed into local in the “mixed” words. The technology works in the following way: if a word contains at least one letter from an “international” alphabet, then the whole word is considered as a “international” one. Then the filter looks for latin letters which looks like international in the word, and changes it into legal international letters (Usually you find such words only in junk mail, and also in some FIDO lists. Legitimate e-mail senders rarely change international letters to look like latin).


The usage of partial transliteration technology helps the filter determine junk e-mail, and also helps to decrease the size of statistical base by storing all words in their uniform international form, without any charset mixing.

Also note that this technology has no sense for English-speaking mail. So, if all the mail you receive is in English, then simply click “Yes” in the upper part of the wizard and move on to the next step.

If your native language is not English, click “No” and fill the next two fields.

In this field “Alphabet” enter your national alphabet, both cases (small and big letters). You can type something like "АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюя", or you can just enter your locale identifier (it is “Russian_Russia” for Russia) and press the button “Generate”. You will see the next graphic as a result:

Some first symbols are the letters of Cyrillic alphabet, but not Russian. Just delete them.

Then to go transliteration table

Here you must enter the table of linking between latin letters and your local alphabet. In the list you can find the latin charset, including special characters which also can be useful (for example, sometimes you can see changing of letter “A” to number “4”, which looks a bit similar). Double clicking on the list element calls it for editing. You must finish the line by your local letters and press “Add”. If you want to delete an entered pair, just type the single latin letter and press “Add”.

After entering the alphabet and transliteration table you can go to the last step of the wizard.

The only button you need – Create. Press it and have a cup of coffee. Depending on the quantity of mail you use to 'seed' the filter, it could take a long time. Please, pay ATTENTION to the button “Options…” in the bottom of this page in the description of filter’s versions.

When the process of initial training is finished, you will see the button “Ready”. By pressing this button, you will leave the wizard and come back to The Bat! where our filter is already installed.

Alternative way of initial learning of the filter

Since the version 0.4fm the way of learning using buttons "Mark as..." in The Bat! was realized in the filter. Somebody may regard this method more useful then learning using the wizard of selection of the folders. Partically, it is not necessary to initially sort the "good" and "bad" e-mails to the different folders. So, for the alternative initial learning of the filter you need to:

1. Select a folder on the first step of the wizard (which exactly is not important, this is only necessary to enable the next step).

2. Set up the partical transliteration (if it is necessary).

3. Pass to the third step and press "Cancel".

(By the fact all these manipulations are necessary only for setting up the partical transliteration. If you don't need this feature, you can just click "Cancel" on the first step of the wizard).

Then, directly in The Bat! you must select the letters and execute the command "Mark as Junk" (for junk mail) and "Mark as NOT Junk" (for "good" mail subsequentally). You can assign a shortcut for these commands using the command from menu "View" — "Edit shortcuts..." (or by hit Alt+F12).

Some tricks and features...

The main differences from the other anti-spam filters and methods are:

  1. The filter needs no online bases and external updates. The whole work of the filter can be effectively based on the e-mails of the user.
  2. Learning of the filter doesn’t need or demand the reading of the spam by a human, nor do you need to learn the spammers tricks.
  3. Fully automated learning is possible [Attention! It is working only in ver 0.3a and from 0.4fm].
  4. Some local-language “tricks” with substitution of some international letters that look like English can be recognized.
  5. The fake HTML comments can be recognized, which usually used to divide a word into parts in a hidden method (for example, por<!--23456-->nography). Moreover, these false comments themselves are regarded as standalone tokens, because they can be found mostly in junk mail, using of such comments in e-mails can significantly increase its final “spaminess” grade. It works like a counterstrike: instead of “hiding” inside the junk email, the filter 'sees' it.
  6. The “percent-coded” HTML lines recognized, where some symbols changed into codes like %43 (mostly in URLs).
  7. The tokens from the message headers recognized from the ones from the body.
  8. The filter can’t be cheated by mixing the cases inside words.

The one disadvantage of the filter is that it needs a relatively large number of junk emails for initial training (this is discussion point: I myself used about 400 junk letters which I've collected to the moment when the first version of the filter was completed, and it was enough to provide practically faultless operation of the filter, but I also heard some comments that the filter was successful even with less then 10 junk e-mails). Of course, the more "junk e-mail" you allow the filter to process, the better it will "learn" what is YOUR specific classification of "junk" or "spam" Anyway, you can solve this disadvantage by using a spam base prepared by somebody else or downloaded from the internet.

The essential advantage of the filter is that its second half of referencing statistical base consists in your “good” e-mails which are different from user to user. It means that every user will have his own unique statistical base. The spammer can’t emulate the filter by somehow cheating it, because each user has created his own filter base.

In the operation of the filter, it writes lines into it’s own log, which can be found here %APPDATA%/Bayesit/Bayesit.log (can be changed via the system registry).

The filter adds three user’s macroses to The Bat!:

%Bayesbase — inserts the current number of “active” tokens from the base (“active” are the tokens which were met at least 5 times).

%Spaminess("word") — for the defined word inserts its “junkiness” in the scale from 0 to 1) 0 means, that the word isn’t present in the active parse; 0.1 – the lowest grade, means totally “good” word; 0.99 – the highest grade, means totally “spammed” word).

%BayesItVersion — (since 0.4fm) — inserts the name and version of Bayesit, as "BayesIt! 0.4fm".

By using this macroses you can write a quick template like this:

---- Base report ------
%BAYESBASE
------ Report by mail delivering time -----------
less then 30 min: %SPAMINESS("t hhour")
30 min to hour: %SPAMINESS("t hour")
hour to 6 hours: %SPAMINESS("t 6hour")
6 hours to 1 day: %SPAMINESS("t day")
more than 1 day: %SPAMINESS("t days")
not defined: %SPAMINESS("t wrong")

----- Report by images -----------
%SPAMINESS("F image<2k")
%SPAMINESS("F image2to10k")
%SPAMINESS("F image>10k")

(it will write a small report about current active base as an example).

In The Bat! itself, besides the property-page “Anti-Spam” also property “Plug-Ins” exists which can tell you some more information about a filter:

 

The versions and the links (current is 0.4gm)

The last version of the filter — 04gm (beta). You can download it here (184kb) or here (135kb). (The first one is SFX distribution, the second is the simple Zip-archive; you must manually run "install.bat" after you unpack the file).

Since the last beta version (0.4fm) there are some more features implemented:

The cosmetig bug corrected which was revealed in showing of the wrong number of processed letters when use "Mark as..." commands (like "0 messages processed by 1 plugin" and etc.)

The working of the filter during starting of The Bat! was corrected (by default the filter didn't grade any letters during the loading of the base, and wrote just the note in the log that the base is no loaded yet. Since now all incoming letters will be regarded).

— The support of "Selective Download" fliters added. Some more about this feature:

As it was mentioned in the section about limitations of the plugins, the working with selective downloading is not possible for a filter directly. However in The Bat! itself you can use an external text file as the source of "signal strings" for selective download filtering. This is exactly way how BayesIt works with this function — it just exports such source file. This file contains the tokens from the message's RFC headers, which was never met in the "good" mail, but was met sometimes in the "junk". These tokens are formated as regular expressions — it helps to avoid some errors in token processing in The Bat!

For setting up the "selective download" with BayesIt you must do the further steps::

1. Install this version (0.4gm) of the filter. If you have the previous version (0.4fm) installed, then you need to download zip-version of the distribution, then overwrite the existing file bayesit.tbp by the new one from the distribution, and also import the file add_04gm.reg into the system registry..

2. Run The Bat! After some seconds the file selective.txt will be created in toe working folder of the filter. This file is the source file for the "selective download" filter.

3. In The Bat! go to the "Configure message filters" and in the section "Selective download" add the new filter using these parameters:

— Rule —"Detect by" —"Entire header"

— Advanced —"Detection method" —"Match any string as a regular expression"

— Advanced —"Load signals string from the file" (must be checked) — and then provide the path to the file selective.txt

That's all! In the future versions I am going to make a "white" and "black" lists, and also make the options (the button "Options" on the third step of the wizard).


The previous version is 04fm (beta). You can download it here (168kb) or here (119кб). (The first one is SFX distribution, the second is the simple Zip-archive; you must manually run "install.bat" after you unpack the file).

Since the last beta version (0.4em) there are some more features implemented:

— macro %BayesItVersion (see above). At the same time I have a note to everybody who use BayesIt: if you've found an error, or if you have a specific question about how to setup/configure the program, please, mention the version in your letters. This macro will help you to do it easy.

— changed installation script, which set-up the BayesIt.

Also some internal structures of the program were changed, which has brought up some subsequences:

the internal structure of regarding information was changed. So I strongly recommend to make a full cleaning of the base and it's regeneration. If you strongly DON'T want to do it, you can make some manual changes (I just tell it from the viewpoint of a batch file which runs in the folder BayesIt\base):

copy spam\msid.idx spamdict.idx
copy spam\msid.lst spamdict.lst
copy nspam\msid.idx nspamdict.idx
copy nspam\msid.lst nspamdict.lst
deltree spam (with all included subfolders/files)
deltree nspam (also, include all subfolder/files)
mkdir transact

With such small corrections the filter can continue working poorly.

The limit of 65536 letters per one "corpus" was removed. Now you can "learn" as many e-mails as you want.

— The learning by using the buttons "Mark as Junk" / "Mark as NOT Junk" (see the menu "Specials" in The Bat!) was implemented. I want to remind you, that you have to "learn" only mistaken e-mails (see below the list of changes of 0.4dm for more details). But... well... if you want, you can "mark as" all letters, even not mistaken, but do you really want to do such redundant tasks? A quick note about learning: since the updating of the whole base requires a noticeable amount of time, it was run as a background process, which runs automatically after half a minute since the last "Mark as..." action (and also after receiving of a new mail). After completion of this updating, the base will be reloaded and updated data will be reflected on the "Information" page.

Initial learning of the base now possible using buttons "Mark as...". Just mark the selected letters, and the base will be created after some time.

It is possible to make "afterlearning" of the base. The letters which were trained before will be just skipped, and only new ones will be added to the base. Partially, if somebody shares his spam dictionary with you — you can just copy in into the filter's working folder and "afterlearn" it by your own letters (either using "Mark as Junk" button, or by using the wizard of initial learning).


The previous version is 04em (beta). You can download it here (164kb) or here (115кб). (The first one is SFX distribution, the second is the simple Zip-archive; you must manually run "install.bat" after you unpack the file)

Since 0.4dm there are some changes:

— Debug mode realized for faster searching of the letters causes errors during initial training.

In debugging mode every letter will be saved in to assigned file before being processed — and if an error arises ("runtime error" or just stuck), this file will contain the problematic letter. As the author of BayesIt I ask you to send me such letters in packed form — to my email which you can found at the bottom of this page.

— The information which is displayed when you click the button "Information" in the preferences window in The Bat!

— fixed some bugs caused errors during initial learning of the filter.


The previous version is — 0.4dm. You can download it here. (110kb).

Please, pay attention! There are some known limitations of this version:

  • — The button “Options” on the final page of the wizard doesn’t work.
  • — It is impossible to make a base which contains more than 65537 letters in the one of it’s half..
  • — Gary Robinson method was excluded because it wasn’t effective enough
  • — The learning is provided, but doesn’t realized yet. All letters which come to “Junk folder” because of filter’s grade are automatically marked inside filter’s statistical base as “junk”, and the rest – as “good”. So, if the filter made a mistake and couldn’t recognize correctly a junk letter, you must especially mark them as junk using the command “Mark as Junk” in the “Specials” menu. (the same way, if you have a good mail which was mistakenly putted to “junk mail”, you must not only remove it into “good” folder, but also “Mark as NOT junk” this letter)

In the current version the letters are stored in the internal folder of the filter. In the near future this folder will be used for autolearning..

The old versions

The previous announced version — 0.3a. You can read the description and download it here (page in Russian).


Copyright © 2002, 2003 by Alexey N. Vinogradov (the owner of klirik.narod.ru)
Используются технологии uCoz