Next: Experimental Results Up: Newt: An Implementation Previous: Scoring and Selecting

Efficiency Issues

As with any software system, during the course of developing Newt, a variety of performance tradeoffs had to be made. A discussion of these issues and the lessons learnt would be helpful when developing similar systems which may be developed in the future.

Not surprisingly, the most important tradeoff at the implementation stage was the classical space-time tradeoff. Re-computing information is time-intensive but saves on storage. Storing information saves on computation time, but needs lots of memory and disk space. The most time-intensive component of Newt by far is the computation of term weights for indexing documents. When several profiles search a number of newsgroups each, it is possible that a newsgroup might be searched by more than one profile. If each profile computes the term weights independently, this would involve expensive recomputation. Instead, if all the indexes are stored, once computed, indexing and matching would take much less time.

As such, if there are enough users, economies of scale might be achieved by pre-processing all documents available in the database. There are two specific suggestions regarding pre-processing which could potentially accrue tremendous benefits. The first is computing term frequencies once and storing them for the lifetime of the article. Since the term counts can be computed for individual documents, these should only be performed once, probably when the article is introduced into the database. The other suggestion is regarding the computation of inverse document frequencies of newsgroups. Inverse document counts can be assumed to not change too much as articles enter and leave the database. Assuming that the articles within a newsgroup typically have similar kinds of keywords, incoming articles would compensate for outgoing articles. Inverse document frequencies for newsgroups could then be computed and maintained as approximations for long periods of time. In the context of USENET newsgroups, computing the inverse document frequencies once a week should be ``good enough''.

This would be economical if a large number of users are using these indexes for personalized filtering. In fact, it might even be worth exploring the possibility of creating an index server, quite like the NNTP (Network News Transfer Protocol) server. NNTP servers are the backbone of USENET - they are responsible for storing and forwarding articles and also responding to user requests for reading articles. A server similar to NNTP could be implemented which would provide article indexes instead of articles themselves. These indexes could be used to match with profiles and score articles. Once top-scoring representations are found, the article could be retrieved from the NNTP server using the Message ID of the article.

Next: Experimental Results Up: Newt: An Implementation Previous: Scoring and Selecting

MIT Media Lab - Autonomous Agents Group - agentmaster@media.mit.edu