As mentioned in Section ,
the inverse document frequency is calculated using the following formula:
where
is the total
number of documents in the ``collection'', of which
contain a given term
.
The collection of documents consists of articles in the set of newsgroups searched.
A profile searches each newsgroup mentioned in the newsgroup field of the profile.
It also searches some newsgroups which may not mentioned in the newsgroup field.
These are the newsgroups which have some article belonging to it that received
feedback through programming by demonstration. The pointers to the articles
that received feedback are listed in the ``ArtFeedback'' list (see table
).
Therefore, the ``collection'' refers to the set of documents in any of these
newsgroups and
is the total number of such documents.
The text-indexing algorithm has been implemented as a two stage process. This is because, while the tf's in a document can be computed for an individual document, the idf's cannot be computed until the whole collection of documents has been parsed once.
In the first stage, only the tf's are computed. The text is
first cleaned to remove punctuation marks and is then stripped down to
individual words. A stoplist is used to eliminate the commonly used
words (for e.g. ``the'', ``and'', ``if'', etc.). The
frequency count of the words is computed and stored, as they will be
needed to calculate the final term weights. This is done for each
of the documents.
In the second stage, the idf's are computed. Another parse over the accumulated data yields the document counts for all the terms, i.e. the number of documents each term appears in. The product of term frequency and the inverse document frequency gives the term weights and the term vector for each document can be generated.