Next: Extracting Document Representations Up: Newt: An Implementation Previous: Genetic Algorithm

The Information Filtering Module

The Information Filtering module (called YAIF, for Yet Another Information Filter) is responsible for actually retrieving articles from the database of news articles. YAIF takes the profiles, scores articles with respect to the profiles and selects the high scoring articles to be presented to the user. YAIF is a time-intensive process and is run offline. The process is executed every night, so that filtered articles are available to every profile in the morning. This frequency is sufficient, since the data within a newsgroup does not change significantly within a day. In this section, the process of finding articles matching one profile is described. The same process is repeated for all available profiles.

Each profile is stored in a separate file. Two sets of newsgroups are searched for each profile. One set is the list of newsgroups specified in the newsgroup field of the profile. The other set consists of the newsgroups which may not be mentioned in the newsgroup field, but some articles belonging to it received feedback and are listed under ``ArtFeedback'' (see table ). This is the case when users program by demonstration. For every profile, YAIF retrieves each article from the two sets of newsgroups and scores them with respect to the profile.

A typical article is shown in table . Each article contains some structured and unstructured information. The unstructured information is the actual information content of the article, namely the text. The structured part is the meta-information about the text of the article. This varies greatly depending on the source of news. There are a number of header lines that an article must have to adhere to the Standard for Interchange of USENET messages [20]. Some of the headers interesting for filtering purposes, that are mandated by the USENET protocol include Date, From, and Subject. Some suppliers of information provide additional information, which is optional. Such fields include pre-indexed keywords, Organization, (number of) Lines, Sender, Location and so on.

Each candidate article evaluated by YAIF is first converted to its representation before they can be scored. The method of translating documents to their representations is described in the next section.




Next: Extracting Document Representations Up: Newt: An Implementation Previous: Genetic Algorithm


MIT Media Lab - Autonomous Agents Group - agentmaster@media.mit.edu