Extracting Document Representations

Next: Scoring Documents Up: Filtering Documents Previous: Filtering Documents

Extracting Document Representations

Documents are represented as a set of fields where each field is a term-vector (see Section ). The fields of the document representation must be extracted from the document itself. All document fields except the keyword field are directly extracted from the header lines of the article. The keyword field is generated from the text of the article.

For example, the location field can be created from the ``Location'' header line, if available. The terms available from the header line are added to the appropriate field and default weights are assigned. For e.g., if the document has the following information:

Location: China, Taiwan

the vector representation in the Location field is: where is the location field in the document representation, is the default weight and ``china'', ``taiwan''. Since the term vectors are normalized when documents are scored, it does not matter what the actual value of the default weight is. The weights just indicate the relative importance of the terms - in the above example both terms have the same relative importance.

The term-vector for the keyword field is obtained through a full text analysis of the documents. The weight of the term depends on its frequency of occurence in the text and the number of documents it appears in. This is a well known term weighting method for the vector-space model in the information retrieval literature [38] and has been adapted for the present use.

The weight of a keyword-term is the product of its term frequency and its inverse document frequency. The term frequency (tf) is the occurence frequency of the term in the text and is normally reflective of term importance. The inverse document frequency (idf) is a factor which enhances the terms which appear in fewer documents, while downgrading the terms occuring in many documents. The resulting effect is that the document-specific features get highlighted, while the collection-wide features are diminished in importance. The weight of the term is given as where is the number of occurences of term in document , and is the inverse document frequency of the term in the collection of documents. A commonly used measure for the inverse document frequency is where is the total number of documents in the ``collection'', of which contain a given term . The collection of documents is the context within which the inverse document frequencies are evaluated. A profile evaluates all documents in each newsgroup mentioned in the newsgroup field of the profile. It also searches some newsgroups which may not mentioned in the newsgroup field. These are the newsgroups which have some article belonging to it that received feedback through programming by demonstration. The implementation details are mentioned in Section .

Next: Scoring Documents Up: Filtering Documents Previous: Filtering Documents

MIT Media Lab - Autonomous Agents Group - agentmaster@media.mit.edu