Scoring Documents

Next: Selecting Documents Up: Filtering Documents Previous: Extracting Document Representations

Scoring Documents

In the classical vector space representation, documents matching queries are retrieved by finding vectors in the proximity of the query vector. A commonly used similarity metric is the cosine of the angle between vectors. This can be calculated by taking a scalar product of the two vectors

The distance metric in equation is generalized for the current application. The similarity between a document and a profile is a function of the similarities between the corresponding fields. The field similarities are first computed. Since each field is a term-vector, the metric used for measuring similarity between two fields of the same type is just as in equation . The similarity scores for the corresponding fields in the document and the profile are computed as shown in equation . The similarity between a complete document and complete profile is computed next. It is the sum of the field similarity scores weighted by field weights of the profile. This computation is shown in equation .

Scores of different fields are added together in equation , weighted by the field weights. However, this would tend to favor fields which have arbitrarily high term-weights, even when the field weights are low. Field scores must only be compared and added unless they are all on a unform scale. This can be achieved by normalizing the field vectors. The scalar product of two normalized vectors lies in the closed interval [-1, 1]. This constrains the field scores to lie in the same interval.

The implicit effect of normalization is that it is not possible to compare term-weights across fields i.e. if the weight of keyword is higher than the weight of author , it does not imply that would contribute more to the document score than . The weight indicates the relative importance of a term with respect to the other terms in the same field. Users creating profiles should be made aware of this fact.

A similar problem lies with the document similarity scores. Profiles go through different parts of the database and score different articles. The similarity scores of documents are with respect to different profiles. When the agent collects all the top-scoring articles retrieved by the profiles, it is not possible to compare the scores unless they are all on the same scale. Besides, a user would not be able to make sense of the similarity score if the scale is not known. Hence, document scores are constrained to be in the closed interval [-1, 1].

The highest score of 1 would only be assigned when the document and profile representations are identical. i.e. Further, when fields are identical the field score will also be 1, because of the definition of field similarities and the constraint in equation The last two equations imply that

Hence, the constraints in equations and together ensure that the field scores and document scores lie on convenient and uniform scales.

Next: Selecting Documents Up: Filtering Documents Previous: Extracting Document Representations

MIT Media Lab - Autonomous Agents Group - agentmaster@media.mit.edu