Charlotte April 1998

Tools for Information Resource Discovery on the WWW...

The first of a regular column on various aspects of information resource discovery, by Charlotte Jenkins, SEED Research Student: c.jenkins@scitsc.wlv.ac.uk

The above is my PhD project title. I am a research student at the University of Wolverhampton. My research group (Search Engine Evaluation & Design - SEED), are busy evaluating existing tools for information resource discovery and designing our own enhancements to the Wolverhampton Web Library (WWLib). WWLib is so called because it organises UK Web resources according to Dewey Decimal Classification (DDC). The existing WWLib was originally developed in 1995 as a result of poor response times, US bias and information overload from the big US search engines. My supervisor (Peter Burden - the original developer) happened to be school librarian at the time and he thought to himself "Isn't it clever how librarians organise books sharing the same subject onto the same shelves..." and since then has been organising UK Web resources according to DDC.

The problem with the original WWLib (aside from its uninspiring interface) is that it relies, to a large degree, on manual maintenance. We are currently redesigning a fully automated WWLib that will support a robot for resource discovery, an automatic indexer and an automatic classifier.

My initial task, as a member of the research group, was to survey existing tools and methodologies for locating information. The outcome of this survey can be found on my Searching the World Wide Web page and, in more detail, in my recent publication in IST (Searching the World Wide Web: An Evaluation of Available Tools and Methodologies - Jenkins, Jackson, Burden and Wallis, published in Information & Software Technology, Vol. 39, No14-15, 1998 by Elsevier).

It is interesting to observe the evolution of search engines: First there were hotlists that gradually became more and more classified until the first classified directories appeared. Classified directories could be browsed hierarchically or queried with a text string. Resources were found manually and organised according to a classification scheme. When it became apparent that there were too many Web sites to find manually, automated search engines with robots appeared that automatically found and retrieved pages. Instead of relying on manual classification, search engines automatically analyse the full text of documents and build huge indexes that can be queried with a text string. The tendency of these tools to inundate users with irrelevant results, however, has meant that they have never quite taken over completely. Yahoo, one of the earliest classified directories, is still respected for its ability to provide accurate, high quality results and an easy intuitive, browsable interface for which no complex boolean syntax is required. Now it seems that progress is back-tracking slightly in an attempt to find tools that are as accurate and intuitive as classified directories while being as comprehensive and up-to-date as automated search engines.

I discovered a suprising number of subject specific gateways when carrying out my survey. The ROADS approach stood out as being one good way of combining accuracy and intuitiveness with comprehensive web coverage. While most of the other independent subject gateways solved the problem of irrelevant information overload, it seemed they could suffer from the same problem as the individual pages they referenced - i.e. being difficult to locate - and obviously, independent subject gateways do little to solve the problem of comprehensive coverage. ROADS gateways, in contrast, being part of a larger cross compatible system, are both more visible and potentially comprehensive; and the manual maintenance by subject experts means that they provide access to high quality, accurately described information.

More recently I have been working on an automatic classifier that, it is hoped, will enable WWLib to preserve its classified nature while evolving into an automated search engine. I have written Java software which, given a URL, retrieves a document and assigns to it some (appropriate!) DDC classmarks. The challenge for any automated tool is to acquire metadata and classification information that is as accurate as that which could have been defined by a human expert.

Further information regarding my progress with automatic classification can be found in a working paper entitled Automatic Classification of Web resources using Java and Dewey Decimal Classification a short version of which I will be presenting at the 7th International World Wide Web conference in Brisbane, and also at the Libraries on the WWW workshop.

____________________
Contents | BUBL | Charlotte's Corner