Contents
1 - Introduction to ROADS
1.1 - The scope of this document
1.2 - What is an information gateway?
1.3 - What is ROADS?
1.4 - What does ROADS do?
1.5 - An introduction to the ROADS data format
2 - ROADS in practice
2.1 - Creating Records
2.2 - Editing Records
2.3 - Subject Listings
2.4 - What's New listings
3 - Customising ROADS
3.1 - Customising the record templates
3.2 - Customising the data entry forms
3.3 - Customising the classification scheme
3.4 - Customising which record fields are displayed and searchable
3.5 - Customising Subject listings and What's New listings
3.6 - Customising the Search Forms
4 - ROADS technical issues
4.1 - Introduction
4.2 - A File System Based Inverted Index
4.3 - ROADS Inter-operability
4.4 - WHOIS++ and Centroids
5 - Further information
5.1 - Pointers to further information
5.2 - Contact information
5.3 - ROADS v0 and v1
5.4 - Acronyms
5.5 - Glossary
This document provides a guide to the purpose and features of the ROADS system, and how it will develop in the future, rather than a detailed instruction manual. The remainder of section 1 provides a basic introduction to the concepts that underlie ROADS, aimed at those who have little or no knowledge of ROADS or information gateways. Section 2 is intended to provide a feel for how ROADS operates in practice, and section 3 provides a detailed description of how ROADS can be customised to suit your application. Sections 2 and 3 are at aimed those who may have some experience of maintaining or using networked information, and are considering using ROADS to build an information gateway (or those who have just read Section 1). Section 4 provides technical background information, and is aimed at those with a more technical interest in ROADS. Section 5 is a list of references and pointers to further information.
At its simplest, an information gateway is just a list of internet resources. On the WWW, this may be simply a list of resource titles, with links to the resources itself. If the list of links is large, the gateway may be divided into often arbitrary categories. An example is Galaxy <http://galaxy.einet.net/galaxy.html>. While these gateways are often very useful, they are subject to some limitations. They often include little information on the resource, and as the subject categories may be arbitrary, browsing them may be a hit and miss affair.
Search engines such as Lycos <http://lycos11.lycos.cs.cmu.edu/lycos-form.html> can also be useful in finding resources. These services are useful because they are very comprehensive, and they often include excerpts from resources. However, the very comprehensiveness of these services can be a disadvantage, as searches can often bring back hundreds of irrelevant documents, and as the excerpts are extracted automatically, they often make little sense.
ROADS (Resource Organisation And Discovery in Subject-based services) is a set of software tools and standards designed to help set up and maintain information gateways to all kinds of Internet resources, including WWW sites, Telnet based services, FTP sites, mailing lists etc. ROADS is designed to overcome the problems described in section 1.2; a ROADS information gateway allows resources to be fully described or abstracted, and classified according to a recognised classification scheme, allowing resources to be located much more efficiently. It is designed to be used by the relevant subject specialists, rather than computer specialists, and as a ROADS based gateway is maintained by a human subject expert, subject irrelevant resources can be excluded.
Underlying a ROADS system is a database of resource descriptions. The records in this database can store a wide range of information about a resource, including title, description, keywords, URLs, classification information and the maintainer or administrator of the resource. The ROADS software tools use the information in the database to automatically build a set of WWW pages allowing the information in the database to be browsed, with each resource described and organised under subject headings according to its classification information. Further ROADS tools allow the database of resource descriptions to be created and maintained from WWW forms, and provide a web-based search mechanism as an alternative access route for users. The browsing and search pages are highly configurable, both in appearance and functionality, allowing the subject gateway owners to control the identify of the service and what it offers its users.
ROADS is more than a set of software tools, it is also a standards track for the future of information about Internet resources (often referred to as metadata). The records in a ROADS database are based on a format called the IAFA template, this in turn works with a search and retrieve protocol called WHOIS++. This means that a user of one ROADS based information gateway will be able to search other ROADS based information gateways from a single search form. In this way ROADS can form the basis of a distributed database of resource descriptions, with subject specialists, rather than computer scientists (or machines), in charge of each branch of the database.
The tools provided by ROADS fall into three main areas:
The first area, the creation and maintenance of records in a database of resource descriptions, is all handled through a set of WWW forms. These forms allow you to create new records, or edit existing records, by specifying the unique "handle" (identifier) of the record you wish to edit. You can select the type of resource you wish to create a record for (e.g. service, document, image etc.) and the number of field clusters required for information such as resource maintainer, of which there may be more than one (for more information see section 1.5).
Figure 1: The record creation form
You are then presented with a form containing boxes for all the fields in the record. You simply fill in the data as required (see Figure 1). In ROADS v1, the last section of the form contains options to update the Subject and What's New listings (see sections 2.3 and 2.4). The exact way in which the data entry form is presented is fully customisable, allowing, for example, a simplified form for novices to be created (see section 3.2).
The second area is the automatic creation of a set of WWW pages using the information in the database records. As mentioned above, with the appropriate options selected on the record creation form, a resource will be automatically added to the appropriate subject listing (see Figure 2) at the time of creation. Alternatively, you can run a script from the Unix command line which will add them to the subject listings without going through the record creation/editing form. ROADS will also create a What's New list. The appearance of the subject listings is fully customisable (see section 3.5).
Figure 2: An example subject listing for Economics from SOSIG.
The third group of tools make up the search engine. The search engine allows the user of a ROADS based information gateway to query the database that underlies the system. Simply keyword search terms can be used, or sophisticated Boolean search terms (see Figure 3). The format of the search form and the options that appear on it are customisable (see section 3.6). One important option you can offer your users, is the choice of whether to display the full resource description (see Figure 4), or just the title. You can also customise which fields are accessible to the user (for example, you may not wish your users to be able to search on certain administration fields, see section 3.4). A modified version of the search engine is provided for the system administrator to find records for editing when the unique handle is not known.
Figures 3 and 4: A search screen and sample search
results.
It is important to note that ROADS is a set of software tools, rather than a monolithic software package. This means that is possible to use just the parts of the system that you require, or use the whole set as a package. For example, you can create a record by hand in a standard text editor, rather than use the ROADS template management tools, or you can use alternative technology for search and retrieval (see section 4.3). In future versions of the ROADS software, users will be able to search across multiple ROADS based information gateways from a single search form. This will be useful for multidisciplinary searches.
ROADS is based on a database of resource descriptions. Each record in the database is a separate plain text file, and these plain text files are formatted according to IAFA templates. There is a collection of IAFA templates for a range of internet resources such as services, documents, images, software, mail archives etc. You are also free to create your own templates, or modify existing ones, though you are encouraged to use the standard templates as far as possible for the sake of compatibility with other information gateways.
The IAFA template organises the record into "attribute value pairs", which are the equivalent of fields in standard database terminology. The attribute value pairs consist of an attribute name to the left of a colon, and an attribute value to the right of a colon. For example:
Title: CTI Centre for Economics Home Page
There are three kinds of attributes in an IAFA template; plain attributes, variant attributes, and cluster attributes. Plain attributes describe the basic characteristics of a resource, such as Title or Description. They contain information about a resource which is only required once. Variant attributes are repeated for each version of a resource if there are multiple versions. Language and URI are examples: if a document is available in French and English, there must be 2 sets of variant attributes, so the language and URL of each version can be recorded. Variant attributes appear in a record thus:
Language-v1: English
Language-v2: French
URI-v1: http://www.example.ac.uk/documents/english/
URI-v2: http://www.example.ac.uk/documents/french/
Every time an individual or organisation occurs in a record there are a number of common data elements required to describe them e.g. name, address, telephone number, e-mail address. These logically grouped data elements are termed clusters. Clusters are defined in separate templates, the fields from which are imported into another record. The two most common are clusters User and Organisation. These are prefixed by an attribute name that defines the role of the cluster. For example, a user cluster that describes the author of a document will appear in the template as:
Author-Name-v1:
Author-Work-Phon-v1:
Author-Work-Fax-v1:
Etc.
The variant number indicates that the cluster may appear more than once.
A complete template may look like this (note that unused fields are not shown):
Template-Type: SERVICE
Handle: 805990087-28320
Title: Electronic Green Journal
URI-v1: http://gopher.uidaho.edu:70/1/UI_gopher/library/egj
Admin-Name-v1: Maria Jankowska
Admin-Email-v1: majank@uidaho.edu
Description: The Electronic Green Journal is a professional refereed publication devoted to disseminating information concerning sources on international environmental topics including: pollution, resources, technology and treatment. The journal is academically sponsored; however the focus is to publish articles, bibliographies, reviews and announcements for the educated generalist as well as the specialist. It began publication in June 1994 and is produced on an irregular basis.
Keywords: environmental issues, green politics, development studies, environment, sustainable development
Subject-Descriptor-v1: 551.588 330.342
Subject-Descriptor-Scheme-v1: UDC
Record-Last-Modified-Date: Wed, 12 Mar 1996 11:06:43 +0000
For information on creating and modifying templates see sections 2.1 and 2.2.
As we have already discussed, a ROADS information gateway is based on a database of records containing resource descriptions. You can write these records manually using a text editing program, but it is recommended that you use the tools which ROADS provides to help you to create and maintain this database. The first group of tools (described earlier in section 1.4) make up a system for creating and maintaining records, based on WWW forms.
Using these tools has many advantages. WWW based forms allow people to contribute to your database from anywhere in the world. The tools can also help by automatically filling in certain attributes, such as the record Handle (the unique identifier for the record), and the date and time the record was created. Attributes can also be defined as mandatory, have a list of possible values defined, or have default values set (see section 3.2).
The first record creation screen allows you to select whether you are creating a new record, or editing an existing one (see section 2.2). It also lets you select the type of record you wish to create. When these options are selected and the "submit selection" button is pressed, a second screen is presented that allows you to select the number of clusters and variants you wish to appear in the template (see section 1.4). When the submit selection button is pressed again, the form that initiates the creation of a record is displayed. The text boxes are simply filled in as appropriate.
The record creation process allows for a number of
options. The record text can be returned to the screen (useful
for checking the record before submitting it to the database),
emailed to the database administrator, or entered into the database.
Two further options let you select whether the resource should
be added to the Subject listings and the What's New
listing at this stage (see section 2.3 and 2.4).
Once a record is created, you will need to edit it to keep it up to date. To edit a record you can select the edit option from the main template creation screen, and enter the handle of the record you wish to edit. As the handles are rather long, and deliberately have no semantic meaning, this can often be difficult. Fortunately ROADS provides another means of locating records for editing. This is provided through a modified administrator's version of the search engine. A search is entered in the same way as normal. The search results have a button after each resource that will display the record creation form with the fields already filled in, ready to be edited.
In order to allow users of a ROADS based information gateway to browse the database of resource descriptions, ROADS provides tools to create a set of subject listings. These consist of a top level listing of all the subject headings, with further pages listing all the resources that come under a particular subject heading. The resource listing can contain links to both the resource itself and the corresponding resource description.
In ROADS v1, you can tell ROADS to insert a resource in the appropriate subject listing at the point of record creation simply by checking this option at the bottom of the record creation form (see section 3.1). In ROADS v0 or v1, you can run the subject listing tool from the Unix command line. This is useful if you have a body of existing templates you wish to include in your ROADS based information gateway.
ROADS does this by using the information in the subject-descriptor attribute. This is used in combination with a classification scheme mapping file. A typical line from the UDC (Universal Decimal Classification scheme) mapping file that comes with ROADS as standard might look as follows:
33:Economics:econ
This means that whenever the subject-descriptor attribute contains 33, that resource will be entered into the Economics subject listing, which is contained in a file called econ.html.
It is possible to enter more than one class number in the subject-descriptor field, in which case the resource will be entered into more than one subject listing. This is useful for multidisciplinary resources. It is also possible to have more than one set of subject descriptor fields. This means that resources can be classified under more than one classification scheme.
It is possible to build a mapping file around an established classification scheme, such as UDC or NLM (National Library of Medicine), or it is possible to create a scheme from scratch to suit your purposes (see section 3.3). It is also possible to customise the way the subject listings look (see section 3.5).
ROADS also lets you produce What's New listings automatically, in much the same way as the subject listings are produced. You can do this either by selecting the appropriate option at the end of the record creation form, or by executing a command at the Unix command prompt. The format of the What's New listings can also be customised (see section 3.5).
Most aspects of ROADS are customisable. This means that you can decide exactly what your users will see, and will not see. This section outlines how you can configure ROADS to your own requirements.
ROADS comes with a series of standard IAFA templates for the following types of resources:
Each template is defined in an outline file that lists the attributes a template contains. It is possible to simply add attributes to an outline file, or create completely new templates from scratch. It is important to bear in mind, however, that ROADS is all about standards. Keeping the templates standard will make it much easier to exchange and distribute data at a later date. If you find that you require a new attribute in one of the templates, it is a good idea to post to the ROADS discussion list, open-roads (see section 5).
The template outline files also let you customise how each attribute is treated by the data entry form. You can determine the size of the text input box, set either default values or lists of possible values, and make the attribute either mandatory or optional.
ROADS v1 also makes it possible to define several different "views" of a template. You can define a simple view that contains only the essential attributes that must be filled in, and more comprehensive views that include all attributes. If more than one view has been defined for a template, then you will be asked to select which view you wish to use before moving to the data entry form.
This is particularly useful if you will be asking a wide variety of people to create records. The standard IAFA templates contain attributes to cover all situations. This means the templates, and therefore the data entry form, are very long, and many of the attributes are not required for the majority of resources. To the inexperienced eye, a data entry form containing all the possible attributes can appear very daunting. It is therefore very useful to be able to create a more friendly form with a more manageable number of attributes.
As we have seen in section 2.1, the classification scheme is very important in ensuring a resource appears in the right subject listing. The classification scheme is contained in a plain text file. ROADS ships with a UDC (Universal Decimal Classification) scheme containing social science headings, but you can create a new classification scheme based either on an established classification scheme, or one of your own devising, to suit your application. Classification schemes already being used by ROADS information gateways include the NLM (National Library of Medicine) classification scheme used by OMNI.
ROADS v1 also allows you to control which attributes are made available to the user, for browsing and searching. Two lists of fields are maintained, one for users of the subject gateway, and one for administrators. The same lists apply to both searching and displaying attributes, so a field that is displayable, is automatically searchable, and vice versa.
As an example, a typical subject gateway might display the title, description, keywords and URL fields for the user. Another might include the fields that contain the email of the owner of a resource, or those that contain access and charging policy if this is appropriate. However, it is unlikely that the user will be allowed to either view or search on administration fields, such as fields that contain comments, discussion, quality ratings or last modified dates. The subject gateway administrator is, on the other hand, likely to want to search and view all fields.
ROADS v1 can also customise the resource listings that the ROADS software automatically generates. This covers both Subject listings and What's New listings, both of which are handled in the same way.
HTML template files are created for the listing of subject headings, the resource listings and the What's New list. The template files for the resource listings and the What's New listings contain both standard HTML tags, and special ROADS tags that allow the subject gateway administrator to define where the links to the resource itself, and the resource description, will appear.
HTML template files are also used to define how the Search forms appear. For example, it is possible to have two search forms, one simplified one with a limited set of options, and an extended form with all the options present. The subject gateway administrator can also use the HTML template to define default values for the various options on the search form.
This section discusses some of the technical issues behind ROADS, including how ROADS will develop into the future. This section is not necessary to understand the basics of how ROADS works, but is provided here as a simple introduction to these issues for those who are interested.
Section 4.2 deals with the inverted index of the database that ROADS builds to allow for fast searching. Sections 4.3 and 4.4 deal with how ROADS will develop in the future, to form the basis of a distributed database of Internet resource descriptions. ROADS v0 implements the IAFA template as the first step towards this goal. ROADS v1 will implement the WHOIS++ search and retrieve protocol as the second step. In addition ROADS v1 will implement an Applications Programming Interface (API) which will allow a back-end interface to an alternative database to be written. ROADS v2 will implement the final element, centroids, which will allow distributed databases to be efficiently searched with a single search term.
The ROADS indexing system uses the UNIX file system as a means of maintaining an inverted index of the database of resource descriptions. Inverted indexes are a common technique for indexing full text databases, such as WAIS.
Inverted index structure
The file structure of the inverted index is shown in figure 5. The root directory of the index contains a number of sub-directories. These directories have two letter names, known as bigraphs, derived from the first two letters of each word in the index. for example the word "adam" will be indexed in a directory called "ad". In each of these directories is a file for each word in the index.
This file is a plain text file which contains a list of resource descriptions which contain the word the file refers to, and the location of the record within the file structure. These files are named after the word they refer to, for example: resources containing the word "adam" will be listed in a file called "adam".
Figure 5: Inverted index structure
Searching the inverted index
Searching for a single search term is carried out using the following process:
The result is a list of resource description handles and locations. This information can then be used to retrieve the data from the resource descriptions, which is then built into an HTML page to be presented to the user.
This is simplified explanation of how the inverted index works. For a full explanation, see "A File System Based Inverted Index" (LUT CS-TR 996) by Jon P. Knight and Martin Hamilton <URL:http://www.roads.lut.ac.uk/Reports/fileindex/fileindex.html>.
ROADS v1 will be provided with an Applications Programming Interface (API) which can be used to interface with an alternative backend database. This may prove useful for those who wish to use a relational database management system (RDBMS). For further details, please contact roads-liaison@bristol.ac.uk.
Work has also been undertaken to investigate interoperability between the WHOIS++ and Z39.50 search and retrieve protocols. ROADS v2 will be able to query multiple WHOIS++ servers, some of which may actually be WHOIS++ gateways to Z39.50 servers. In this way, ROADS can access data that is being made available with Z39.50.
A centroid is essentially a simple inverted index. It is designed to be shared amongst servers in a network environment to provide hints to the location of data in large, loosely coupled distributed databases. The centroid is constructed by removing duplication from a set of data.
The process of creating a centroid from a particular set of resource descriptions essentially consists of looking through all of the resource descriptions, and creating a list of unique words for each attribute. This means that only the first instance of a word is recorded, even though it may occur many times in an attribute. Thus the centroid for a typical set of resource descriptions will be much smaller than the set of resource descriptions itself. When a centroid is created, not all attributes in the set of resource descriptions need be included, for example those attributes that are only of interest to subject gateway administrators may be omitted.
Once a server has created centroids for its database, it can share them with other servers in the network. This can be accomplished in two ways. Firstly a server can have a list of other servers with which it should share its centroids. It can then connect to each server in turn and "push" its centroids to those servers. Secondly, a server can have a list of servers from which it will receive centroids. It can then regularly connect to each server in turn and "pull" centroids from them. A single server may use a combination of both methods.
When the end user of a subject gateway enters a search term in a WWW form and clicks on submit, the search term is passed to a program on the WWW server that acts as a WHOIS++ client. The program will be configured to talk to the subject gateway's WHOIS++ server in the first instance. If the subject gateway's WHOIS++ server contains resource descriptions which match the search term, they will be passed back to the program to be converted into HTML search results.
If the WHOIS++ server has matches in the centroids it has recovered from other WHOIS++ servers, a referral to those servers is then passed back to the program. If the program is configured to do so, it will automatically send the search term to those WHOIS++ servers. If the program received referrals to other servers, the end user can be asked which of these servers s/he wishes to query.
This is a very short and simplified explanation of centroids.
For a fuller explanation, see "Overview of the ROADS software"
(LUT CS-TR 1010) by Jon P. Knight and Martin Hamilton <URL:http://weeble.lut.ac.uk/Reports/arch/arch.html>.
The membership of the open-roads mailing list is open to anyone with an interest in ROADS. You can join open-roads by sending an email to:
open-roads-request@mrrl.lut.ac.uk
The body of the message should consist of the word "subscribe" alone. The open-roads archives are available at:
http://weeble.lut.ac.uk/lists/open-roads/
A WWW site is available which provides further information on the ROADS project. All the technical reports to which reference is made in section 4 are available from this site. The URL is:
http://ukoln.bath.ac.uk/roads/
For more information about ROADS, you can contact the project by email at:
roads-liaison@bristol.ac.uk
You can also contact us at:
Address: ROADS Project
Centre for Computing in the Social Sciences
University of Bristol
8 Woodland Road
Bristol BS8 1TN
Tel: 0117 928 8478
Fax: 0117 928 8473
V0.2.5 is the current version of ROADS. It is available from the ROADS WWW site. ROADS v1 is currently in alpha testing. It is expected to be made available in beta test form by Autumn 1996.
| API | Applications Programming Interface |
| FAQ | Frequently Asked Questions |
| FTP | File Transfer Protocol |
| HTML | HyperText Mark-up Language |
| IAFA | Internet Anonymous File Archive |
| NLM | National Library of Medicine classification Scheme |
| OMNI | Organising Medical Networked Information |
| RDBMS | Relational DataBase Management System |
| ROADS | Resource Organisation And Discovery in Subject-based services |
| SOSIG | Social Science Information Gateway |
| UDC | Universal Decimal Classification scheme |
| URI | Universal Resource Identifier |
| URL | Uniform Resource Locator |
| WAIS | Wide Area Information Servers |
| WWW | World Wide Web |
| Attribute | The basic unit of data in ROADS database record. Consists of an attribute name to the left of a colon, and an attribute value to the right. The equivalent of a field in standard database terminology. |
| Boolean | A system of algebra concerning the two truth values, TRUE and FALSE and the functions AND, OR and NOT. It is used to construct complex search terms. |
| Cluster | A group of attributes that are required each time an object such as an organisation is described in a record, eg name, address, email etc. |
| Field | The equivalent of an attribute in ROADS terminology. |
| File Transfer Protocol | A client-server protocol which allows a user on one computer to transfer files to and from another computer over the Internet. Also the client program the user executes to transfer files. |
| Handle | A unique identifier given to each record in a ROADS database |
| Internet resource | An information object, or collection of objects, accessible through the internet. |
| Mailing List | An e-mail address that is an alias which is expanded to yield many other e-mail addresses. Email sent to the mailing list is therefore sent to everyone on the list. |
| Metadata | Data about data. Data describing aspects of actual data items, such as author, language, location etc. Jack Myers, founder of Metadata Information Partners, coined the term in the early 1960's. |
| Plain attribute | An attribute used to contain the basic characteristics of a resource, that is, information only required once, such as Title or description |
| Protocol | A set of formal rules describing how to transmit data, especially across a network. |
| Resource description | A term used within ROADS to refer to a database record containing a description of an Internet resource. |
| Search term | A text string used to interrogate a database. This may range from a single word, to complex search terms based on Boolean Logic. |
| Subject gateway | An information gateway that focuses on a particular subject area |
| Subject list | A ROADS term to describe a WWW page containing a list of resources that have been classified under the same subject heading. The Subject lists are accessed from a list of subject headings. |
| Telnet | The Internet standard protocol for remote login. It allows the user to login to a remote computer, and use the remote computer as if it were on the user's own desk. |
| Template | A text file used to define the structure of a record in a ROADS database. It contains a list of the attributes that will appear in the record. |
| URI | The generic set of all names and address referring to objects on the WWW |
| URL | A draft standard for specifying an object on the Internet, such as a file or newsgroup. URLs are used extensively on the World Wide Web. They are used in HTML documents to specify the target of a hyperlink. |
| Variant attribute | An attribute that is required for each instance of a resource where multiple versions of the same resource exist, eg where a resource is available in two languages. |
| What's New List | A WWW page containing a list of resources newly added to a ROADS database |
| WHOIS++ | A search and retrieve protocol designed to work with IAFA templates. Will be used in ROADS versions 1 and 2 |
| World Wide Web | An Internet client-server hypertext distributed information retrieval system which originated from the CERN High-Energy Physics laboratories in Geneva, Switzerland. |
| WWW form | A WWW page that allows the user of WWW browsing software (eg Netscape) to send information to the WWW server. The form can contain text input boxes and a number of other elements such as check boxes and list boxes. |
| Z39.50 | A search and retrieve protocol designed to work with MARC records. |
Copyright
This document is copyright of the ROADS Project, University of Bristol.
Material in this document may be copied and/or adapted for bona fide academic purposes within UK Higher Education institutions as long as this copyright notice in included in any such copyright or adaptation. Any charge for the supply or use of any copy should not exceed the direct production costs. Any other use is prohibited without the prior permission of the copyright owner, the ROADS Project, University of Bristol