In the classical vector space representation, documents matching queries are retrieved by finding vectors in the proximity of the query vector. A commonly used similarity metric is the cosine of the angle between vectors. This can be calculated by taking a scalar product of the two vectors
The distance metric in equation
is generalized for the current application. The similarity between a document
and a profile is a function of the similarities between the corresponding fields.
The field similarities are first computed. Since each field is a term-vector,
the metric used for measuring similarity between two fields of the same type
is just as in equation
.
The similarity scores for the corresponding fields in the document and the profile
are computed as shown in equation
.
The similarity between a complete document and complete profile is computed
next. It is the sum of the field similarity scores weighted by field weights
of the profile. This computation is shown in equation
.
Scores of different fields are added together in equation ,
weighted by the field weights. However, this would tend to favor fields which
have arbitrarily high term-weights, even when the field weights are low. Field
scores must only be compared and added unless they are all on a unform scale.
This can be achieved by normalizing the field vectors. The scalar product of
two normalized vectors lies in the closed interval [-1, 1]. This constrains
the field scores to lie in the same interval.
The implicit effect of normalization is that it is not possible to
compare term-weights across fields i.e. if the weight of keyword
is higher than the weight of author
, it does not imply
that
would contribute more to the document score than
. The
weight indicates the relative importance of a term with respect
to the other terms in the same field. Users creating profiles should
be made aware of this fact.
A similar problem lies with the document similarity scores. Profiles go through different parts of the database and score different articles. The similarity scores of documents are with respect to different profiles. When the agent collects all the top-scoring articles retrieved by the profiles, it is not possible to compare the scores unless they are all on the same scale. Besides, a user would not be able to make sense of the similarity score if the scale is not known. Hence, document scores are constrained to be in the closed interval [-1, 1].
The highest score of 1 would only be assigned when the document and profile
representations are identical. i.e.
Further, when fields are identical the field score will also be 1, because of
the definition of field similarities and the constraint in equation
The last two equations
imply that
Hence, the constraints in equations
and
together ensure that the field scores and document scores lie on convenient
and uniform scales.