Extracting Document Representations

Next: Assigning Feedback Up: The Information Filtering Previous: The Information Filtering

Extracting Document Representations

As mentioned in Section , the inverse document frequency is calculated using the following formula: where is the total number of documents in the ``collection'', of which contain a given term . The collection of documents consists of articles in the set of newsgroups searched. A profile searches each newsgroup mentioned in the newsgroup field of the profile. It also searches some newsgroups which may not mentioned in the newsgroup field. These are the newsgroups which have some article belonging to it that received feedback through programming by demonstration. The pointers to the articles that received feedback are listed in the ``ArtFeedback'' list (see table ). Therefore, the ``collection'' refers to the set of documents in any of these newsgroups and is the total number of such documents.

The text-indexing algorithm has been implemented as a two stage process. This is because, while the tf's in a document can be computed for an individual document, the idf's cannot be computed until the whole collection of documents has been parsed once.

In the first stage, only the tf's are computed. The text is first cleaned to remove punctuation marks and is then stripped down to individual words. A stoplist is used to eliminate the commonly used words (for e.g. ``the'', ``and'', ``if'', etc.). The frequency count of the words is computed and stored, as they will be needed to calculate the final term weights. This is done for each of the documents.

In the second stage, the idf's are computed. Another parse over the accumulated data yields the document counts for all the terms, i.e. the number of documents each term appears in. The product of term frequency and the inverse document frequency gives the term weights and the term vector for each document can be generated.

Next: Assigning Feedback Up: The Information Filtering Previous: The Information Filtering

MIT Media Lab - Autonomous Agents Group - agentmaster@media.mit.edu