Document

Next: Profile Up: Representation Previous: Representation

Document

A standard method of indexing text consists of recognizing individual words, eliminating the commonly used words included on a word-exclusion list and using the remaining words for content identification of the texts. Sometimes, phrases are compacted and treated as a single term. Words may also be truncated to word-stems. Generally speaking, a ``term'' is used for text identification. Since the terms are not all equally important for content representation, importance factors (or weights) are assigned to the terms in proportion to their presumed importance for text content identification [41]. A text is then representable as a vector of terms where represents the weight of term in text .

In the information filtering context being considered in this thesis, documents contain more than just text. The documents can contain other information such as the author of the document, the source of information, the geographic origin of the news article, etc. It is therefore necessary to generalize the term-vector representation mentioned above. The generalized representation used is as follows.

A document consists of many fields. The text is just one of the many fields (and will be referred to as keyword field hereafter). The other fields could include author, location (the geographic origin of the news article), date (when the article was posted), (number of) lines, etc. There is no restriction on the number of fields, so long as it represents some attribute of the document. The newsgroup field needs special mentioning, since it is perhaps the only domain-specific field used in this representation. The news source dealt with in this thesis is USENET network news available on the Internet. USENET is organized as a collection of newsgroups, each of which represents a broad category for the articles contained in it. Each document, therefore, has a newsgroup field indicating its category.

Each field is assigned terms to be used for identifying purposes. Since the terms are not all equally important, they are assigned weights. A field is thus represented as a term-vector similar to the one introduced above. In particular, where is the weight of term in field . The subscript can be ``a''(uthor), ``k''(eyword), ``l''(ocation) and so on. The superscript indicates that is a document field, (as opposed to a profile field described below).

Since a document consists of many fields, it is represented as a set of field-vectors. Formally, where is a field in document .

Next: Profile Up: Representation Previous: Representation

MIT Media Lab - Autonomous Agents Group - agentmaster@media.mit.edu