Documents are represented as a set of fields where each field is a term-vector
(see Section ).
The fields of the document representation must be extracted from the document
itself. All document fields except the keyword field are directly extracted
from the header lines of the article. The keyword field is generated from the
text of the article.
For example, the location field can be created from the ``Location'' header line, if available. The terms available from the header line are added to the appropriate field and default weights are assigned. For e.g., if the document has the following information:
Location: China, Taiwan
the vector representation in the Location field is:
where
is the location field in the document representation,
is the default weight and
``china'',
``taiwan''.
Since the term vectors are normalized when documents are scored, it
does not matter what the actual value of the default weight is. The
weights just indicate the relative importance of the terms - in the
above example both terms have the same relative importance.
The term-vector for the keyword field is obtained through a full text analysis of the documents. The weight of the term depends on its frequency of occurence in the text and the number of documents it appears in. This is a well known term weighting method for the vector-space model in the information retrieval literature [38] and has been adapted for the present use.
The weight of a keyword-term is the product of its term frequency and its
inverse document frequency. The term frequency (tf) is the occurence
frequency of the term in the text and is normally reflective of term importance.
The inverse document frequency (idf) is a factor which enhances the terms
which appear in fewer documents, while downgrading the terms occuring in many
documents. The resulting effect is that the document-specific features get highlighted,
while the collection-wide features are diminished in importance. The weight
of the term is given as
where
is the number
of occurences of term
in document
, and
is the inverse
document frequency of the term
in the collection of documents. A commonly used measure for the inverse document
frequency is
where
is the total number
of documents in the ``collection'', of which
contain a given term
.
The collection of documents is the context within which the inverse document
frequencies are evaluated. A profile evaluates all documents in each newsgroup
mentioned in the newsgroup field of the profile. It also searches some newsgroups
which may not mentioned in the newsgroup field. These are the newsgroups which
have some article belonging to it that received feedback through programming
by demonstration. The implementation details are mentioned in Section
.