Integrated Access to Hybrid Information Resources
Tamar Sadeh
SIMULTANEOUS SEARCHING
Simultaneous searching refers to a process in which a user submits a query to numerous information resources. The resources can be heterogeneous
in many aspects: they can reside in various places, offer information in various formats, draw on various technologies, hold
various types of materials, and more. The user's query is broadcast to each resource, and results are returned to the user.
The development of software products that offer such simultaneous searching relies on the fact that each information resource
has its own search engine. The simultaneous searching product transmits the user's query to that search engine and directs
it to perform the actual search. When the simultaneous searching software receives the results of the search, it displays
them to the user.
Simultaneous searching is also known as integrated searching, metasearching, cross-database searching, parallel searching, broadcast searching, and federated searching. [1] MetaLib, the library portal from Ex Libris, provides such simultaneous searching with its Universal Gateway component. In this paper, we shall refer to these systems
as metasearch systems.
Let's take a look at an example of a metasearch process that a user carries out via MetaLib or a similar product.[2]
A student is interested in the works of Henrik Ibsen. Since the student knows that Ibsen is Norwegian, she submits a search
query in several Norwegian resources that she knows about, such as the catalog of the National Library of Norway, the catalog
of the University of Oslo, and several archives maintained by the National Library of Norway-the television, radio, and newspaper
archives. The student submits the query author = Henrik Ibsen to all these information resources. She then receives the results. If they are displayed by resource, she can easily pick
out the results that seem most relevant. Let's say that one result from the television archive is a program about the play
Peer Gynt, written by Ibsen. Looking at this record, the student decides that she can focus solely on the work Peer Gynt rather than all of Ibsen's works. She then uses additional functions of the system to submit a second query, title = Peer Gynt, to the same information resources. This time she receives different results, including the Peer Gynt Suite, composed by Edvard Grieg - a result from the radio archive that she did not obtain earlier. However, the Ibsen play A Doll's House, from the catalog of the University of Oslo, did not come back this time, although it was on the previous result list.
Let's take the process one step farther and ask another question: How did the student know of the resources relevant to her
research? Of course, she could have been knowledgeable in this area and aware of pertinent resources. If she was just starting
out, however, perhaps she concentrated on the default resources that her library has set, on the basis of her group affiliation,
as a component of its gateway. Alternatively, she might have requested resources relevant to her subject, a specific geographic
region, a certain type of material, and so on, thus creating a personal searching scope maintained by the system and available
for reuse. In MetaLib, such functionality is provided through the Information Gateway component.
ONE-STOP SHOPPING
Most researchers today deal with content residing in a wide range of materials. For example, our student might want to access
materials such as the script of the play in book form or PDF file, literary analyses of the play, various recordings of the
suite, the score of the suite, or a video or poster of a specific performance. The immediate search result is typically a
bibliographic record or other form of metadata describing the actual material. From the end user's perspective, the bibliographic
records serve only as a means of obtaining the material itself. Users do not want to be bothered with technical issues such
as the format of the material they seek and the software that they need to access it - the library OPAC, Adobe® Acrobat® Reader®, Microsoft® Word or PowerPoint®, MP3, the MrSid viewer, or any other software that handles specific types of files.
To provide users with convenient access to materials contained in a range of resources, multiple software products need to
be integrated, and they should offer a seamless interface to users. The first type of information typically presented to users
as a search result is a description of the material - the metadata - such as a bibliographic record representing a video.
Ideally, the user should see the material on her screen - in this case, a video - without having to concern herself about
how to find the actual material and how to view it.[3]
The link from the bibliographic record to the actual material can be direct, an explicit URL embedded in the metadata, as
in the MARC 856 field of a bibliographic record in library catalogs. However, in many instances, the system must perform calculations
to create the link - for example, when the bibliographic record resides in one information repository, such as an abstracting
and indexing database, but the actual material resides elsewhere, such as in an e-Journal repository or the library's printed
collection. The user expects to reach the actual material nevertheless. A library can make this possible by configuring a
context-sensitive linking server, such as the Ex Libris SFX server (Van de Sompel & Beit-Arie, 2001), that links the user
to the actual material as a part of a set of extended services and onward navigation options. Such links include the appropriate
copy of an article, the holdings in the user's library OPAC or any other relevant OPAC, the institution's document delivery
service, citation information, a periodical directory, Internet searches, and information about the book in Internet bookstores
or content-based services such as those offered by Syndetics. The software determines the list of links on the basis of the
information in the specific bibliographic record and the institution's subscriptions and policies as predefined by the librarians.
RESOURCE DISCOVERY AND INFORMATION DISCOVERY
The process of finding relevant materials for research falls, therefore, into two stages. First is the resource discovery
phase, when the user locates the resources most relevant to the specific search. Next comes the information discovery phase,
when the search is executed in the various information resources and the results are retrieved. Institutions strive to provide
their members - students, staff, and researchers - with high quality resources that offer information of real value. It is
up to the librarians to determine what constitutes the institution's collections, both physical and virtual, and set the collections'
boundaries. Every member of the institution should be able to define a personal scope that derives from the institution's
scope.
Once the user sets the scope of the search and submits a query, the information discovery phase begins. The metasearch system
delivers the query to the selected information resources and returns the results to the user. The process requires that the
system 'understands' the expectations of the resources regarding the form of the query, on the one hand, and the nature of
the results, on the other. It is up to the system to convert the unified query and adapt it to the requirements of each searched
resource, deliver the query in the form appropriate to each resource, receive the results, and manipulate them so that they
comply with the system's unified format.
RESOURCE METADATA
The first question, therefore, is which resources are available and which of those are appropriate for the institution. No
software can replace librarians when it comes to an understanding of the scholarly information arena; only they can select
the resources that are appropriate and affordable for their institution. However, the selection of a resource is just the
first step. Information about the resource, resource metadata, is necessary as well. The metasearch software needs to obtain descriptive metadata about the resource, such as its coverage
and the types of materials that it offers, and makes it available to end users so that they can make a knowledgeable decision
about the relevance of the resource to their needs. Furthermore, the system needs technical metadata regarding its impending
interaction with the resource.
Resource metadata can be made available in several ways:
• |
Resources can offer their metadata to any metasearch system that attempts to access them for the purpose of information retrieval. |
• |
A central repository can offer resource metadata to any metasearch system. |
• |
Metasearch systems can maintain their own repository of resource metadata. |
The first method - that a resource describes itself when relevant - seems the best. Resources provide the most accurate information
about themselves, information that other repositories need not replicate. As a matter of fact, the Z39.50 Explain function was based on this premise. The idea was that when external software needed to access an information resource,
the software would extract the details of the impending interaction from the resource on the fly and use the information to
formulate the exact steps of the interaction. Apparently, few vendors implemented the Z39.50 Explain function, and those who
did implemented it in a variety of forms. The Semantic Web approach takes the idea one step farther. With this approach, a typical metasearch process involves an interaction between
agents that exchange requests and information to construct the final product, which is the information requested by the end
user. This is the vision, but today's Web does not allow for such interaction between agents, and, therefore, an automated
interaction between the metasearch system and a resource's own search engine cannot be achieved at the present time (Sadeh
& Walker, 2003).
The second method - building and maintaining a central repository - is under discussion by the new NISO metasearch committee,
MetaSearch Initiative, which was formed in early 2003. Maintaining a central repository would assure the availability of resource metadata but
would pose new challenges. First, a decision would need to be made about which kinds of resources such a repository would
store. Then a format for the resource metadata would need to be specified, as well as protocols dictating the manner in which
resource metadata find their way to and from the repository. Finally, a decision would have to be made about who is responsible
for storing information in the repository and keeping it updated - the repository, by means of harvesting programs, or the
resource itself. Another undertaking similar to that of the NISO committee is the Information Environment (IE) Service Registry pilot project, driven by MIMAS, in the UK, in collaboration with UKOLN and the University of Liverpool. The purpose of the project is to provide a registry of IE collections and services and examine the feasibility of such a
registry in terms of discovery, access, maintenance, sustainability, ownership, and scalability. The information science community
is watching these initiatives with interest to see whether such repositories become comprehensive and robust enough to provide
services as necessary.
The third method is one that various current metasearch products have already implemented. Each such product holds the metadata,
both descriptive and technical, of all the resources that it can access. Products differ in the amount of descriptive metadata
that they release to the end user and the way in which they display it. They also differ in the degree to which they implement
the search interaction and hence vary in the amount of technical metadata that they store.[4] The method whereby each metasearch system maintains information about the resources has many drawbacks. The most obvious
one is that every vendor of a metasearch system has to configure and maintain the resource metadata. Handling such a repository
requires considerable effort and therefore depends on the capabilities of the individual vendor.
MetaLib, like other products, provides a repository that includes the metadata of all the resources that it can access. However,
the metadata are not maintained as part of the software but stored in the MetaLib Knowledge Base, a repository of resource
data and rules. The software itself does not include any information that relies on specific resources: it extracts the information
from the Knowledge Base. This information enables the user to select the resources and the MetaLib Information Gateway to
perform the actual search and retrieval. If, in the future, one of the first two options regarding the origin of the resource
metadata materializes, MetaLib will only need to extract the required metadata from another repository.
THE METALIB KNOWLEDGE BASE
The MetaLib Knowledge Base is a proprietary repository provided to institutions along with the MetaLib software. The Knowledge
Base holds two types of metadata about resources:
• |
Descriptive metadata, such as the resource's name, coverage, language, data types, and publisher. The user sees this information
and, with it, can make a sensible selection of resources. It is the same information that enables the system to create resource
lists based on the user's specifications and display them in a comprehensive way. In short, this information serves the resource
discovery phase described earlier. |
• |
Technical metadata, such as the type of protocol that the resource supports, the cataloging format it uses, and the physical
and logical structure of the records that it retrieves. We can describe this information as rules that define the flow, interface,
and manner of searching and that the software uses for searching, retrieving the results, and manipulating them - that is,
for the information discovery phase. |
The resource metadata in the MetaLib Knowledge Base can be divided into global metadata and local metadata:
• |
Global metadata are that part of the resource metadata that is universal and does not depend on the implementation of MetaLib
at a specific institution. These metadata include the name of the resource owner, the coverage, and the interfacing rules.
|
• |
Local metadata are institution-specific; they relate to the way in which the resource is used in the institution's environment
and presented to the institution's members. Such metadata include elements of authentication vis_à-vis the provider of the resource, the authorization rules that apply to it within the institution, and the categorization
information that the institution uses to enable the software to offer the resource in specific contexts. For instance, one
institution might categorize a certain resource under Medicine, whereas an institution with a different orientation might categorize it under Social Studies. |
Ex Libris maintains a master Knowledge Base, which is copied to every MetaLib installation. Automated routines ensure that
the Knowledge Base at each installation is updated as necessary. Institutions localize the relevant metadata and add configurations
to local resources.
SEARCHING AND RETRIEVING
The process of searching and retrieving in a heterogeneous environment is far from trivial. Each resource has its own expectations
regarding the form and manner in which it receives queries; even if the resource supports a standard interface, such as the
Z39.50 protocol, the metasearch system needs to make further adjustments so that the resource's engine will interpret the
query correctly.
The types of information that the Knowledge Base maintains to enable the system to search include the following examples:
• |
Access mode: What kind of interfacing protocol does the resource employ? Is it a structured, documented interface, such as
Z39.50, the PubMed Entrez protocol, or a proprietary XML gateway? Or is it an unstructured HTTP protocol that dictates the
use of HTML parsing techniques to access the resource? |
• |
Password control: How does the user access a specific, licensed resource? Are a user ID and password required, which the metasearch
system delivers when the connection is established? Should the software redirect the query via a proxy to grant the user access? |
• |
URL creation: If a URL needs to be formulated to hold the specific query, what should the structure of the URL be? |
• |
Character conversion: What character set does the system use at the resource end? Does the character set comply with that
of the end user? |
• |
Query optimization: How should the query be structured? |
1. |
What is the exact syntax that the resource's system expects? |
2. |
How should fields be mapped to the fields of the resource; for example, to which field should the system map the "author"
field selected by the user for a specific query? |
3. |
How does the system expect to receive an author's name? Should it be
<last name><,><first name>; <last name>< ><first initial>; or in some other format?
|
• |
Normalization: What should the system do when the search engine at the resource end does not support a specific type of search?
For instance, what rules should be applied if the user looks for a specific subject but a certain resource does not support
a search by subject? |
Once the information is there, the metasearch system can indeed adapt a single, unified query to the requirements of the specific
resource, as in the following example.
The user submits a query for title = dreams and author = Schredl, Michael in the following resources:
• |
Library of Congress (Z39.50 access to Endeavor's Voyager ILS) |
• |
NLM PubMed (the Entrez HTTP protocol) |
• |
HighWire Press® (HTML parsing) |
• |
Ovid MEDLINE® (Z39.50 access via the SilverPlatter ERL platform) |
• |
University of East Anglia (XML access to the Ex Libris ALEPH ILS) |
Even when looking at one brick of the process structure - the query syntax - we can clearly see the differences between the
resources:
• |
The Library of Congress expects this query string: 1=Schredl, Michael AND 4=dreams |
• |
PubMed expects this query string: term=dreams+AND+Schredl+M |
• |
HighWire expects to see the encoded form of the following URL: author1=Schredl,+Michael&author2=&title=dreams |
• |
Ovid's MEDLINE via ERL, although accessed by the same protocol (Z39.50), expects this query string: 1003=Schredl-M* AND 4=dreams(Note the phrasing of the author's name.) |
• |
The ALEPH system at UEA expects the following encoded request: wau=(Schredl, Michael) AND wti=(dreams) |
PRESENTATION OF SEARCH RESULTS
Up to now we have discussed only the flow from the user to the resource. However, now that the query has been processed, the
metasearch system needs to get back to the user with search results. Typically the interaction between the metasearch system
and the resource consists of two phases. The first occurs after the search has been invoked: the resource returns the number
of hits and some kind of reference to the result set. This phase is important because it gives the user some information about
the search and enables the user to refine the query before browsing through the results. For instance, if a user sees that
there are thousands of hits, she can modify the query to be more specific and thus reduce the number of results. The second
phase consists of retrieval: The metasearch system retrieves the number of hits along with the first few records for each
resource. This information is shown to the user instantly, even though the query might result in hundreds or thousands of
hits. Some systems, including MetaLib, allow for further retrieval upon request.
Why do the systems provide such limited retrieval initially? First, retrieval depends on the use of networks, which are still
not as rapid as one would like. Retrieving hundreds or thousands of records over a network is an extremely time-consuming
process, and users are not likely to wait until it is completed. Second, people have difficulty handling immense result sets;
after seeing the number of hits for each resource, users are likely to refine their query to obtain fewer hits. Once retrieved
from the resource, each result is converted to a unified format before the user sees it. The rules that define the manipulation
of the retrieved data are part of the resource metadata, which, in MetaLib, is stored in the Knowledge Base. These rules include
information about the logical format, the cataloging format, the script, and the structure of certain fields, such as the
citation field. For further processing to take place, the metasearch system must be able to apply these rules and convert
all retrieved records, regardless of their origin.
Such additional processing can include the unified display of the records to end users; the merging of result lists from heterogeneous
resources into one list; the comparison of records to eliminate duplicates; the creation of an OpenURL to allow context-sensitive
reference linking; and the saving of records in whatever format is required. Consequently, functionality that might have been
missing from the native interface of the resource, such as the provision of an OpenURL, is added to the same set of records
by the metasearch system. However, the display of result lists is not as straightforward as might be expected. Users are well
acquainted with Web search engines and therefore have solid expectations regarding the display. They would like their results
ranked, merged into one list, and filtered for a selected resource. Furthermore, they would like to be able to sort results
by various attributes, such as title, author, and date.
Given that only the first results are retrieved from the various resources, these expectations are not so easily satisfied.
When the result sets are small, all records are in the system's cache memory and so the metasearch system can offer the expected
functionality in a comprehensive manner. However, the larger the number of hits, the greater the value of merging, sorting,
de-duplication, and ranking - and the more difficult these features are to provide. Consider, for instance, the merging of
the lists. How should it be done? The number of hits may vary considerably from resource to resource. Would it be appropriate
to merge the two hits received from one resource with the dozens or hundreds of hits received from another resource? And if
so, in which order? Every resource returns results in a different sorting order - by date (ascending or descending), title,
relevance, or another attribute of which the users are not necessarily aware. Because only the first records are retrieved,
the issue of merging the results needs careful consideration.
Other issues are the sorting capability and relevance ranking that users expect to find when looking at results, even when
the resource itself does not support such functionality. Does it make sense to rank and sort only those results that have
been retrieved? Let's say that the metasearch system applies certain relevance-ranking algorithms to all retrieved records
and sequences them accordingly in the display to the user. This display can be rather misleading, because the 'best' hits
are not necessarily those that were retrieved first. It could well be that if the user asks for more hits better results will
be retrieved. A similar problem applies to sorting: even though a system might enable the user to sort the records according
to various parameters, this sorting would apply only to the set already retrieved.
MetaLib handles these issues by always allowing the user to see the results for each resource. If the resource supports sorting,
the user can request that the result list be sorted. Then MetaLib submits the search to this resource again, asking that the
entire set of results be arranged in the order requested by the user. Hence, the user indeed receives the first records of
the whole set. MetaLib also enables users to explicitly request and obtain a merged set at any point. Such a set is already
de-duplicated and sortable. Institutions are likely to limit the number of records that can be merged, to avoid lengthy waiting
periods caused by the retrieval of large result sets.
LOCAL REPOSITORIES AND LOCAL INDEXES
End-users may wonder why other searching systems, primarily the Web search engines, are able to provide them with large sets
that are merged and ranked. The reason is that these systems use a different type of technology to provide the users with
search results. Metasearch systems are based on 'just-in-time' processing. The system does not maintain any indexes of its
information landscape locally; only when the information is required does the system access the various resources to obtain
the results. The approach of Web search engines is based on 'just-in-case' technology. Huge efforts are invested in preparing
the information prior to users' requests so that when the information is needed, it is obtained immediately. Google, for example, holds indexes for the entire World Wide Web, including not only pointers to sites but also information
that enables the search engine to evaluate the relevance ranking of a site. When the user searches with Google, only the indexes
are scanned - and the information that Google initially displays on the screen is not from the sites themselves but from this
vast repository of indexes. The search engine provides the actual access to a certain Web location only when the user selects
it from the list. Needless to say, huge computing power and disk space along with sophisticated technologies for harvesting,
evaluating, and maintaining the information are necessary for such powerful tools.
The use of local repositories of indexes in the library environment started some time ago. As opposed to union catalogs, which
actually replicate the information that is located in local catalogs, repositories such as MetaIndex from Ex Libris hold only
the indexes to the bibliographic materials that are kept in the resources. An example is the MetaIndex implementation at the
Cooperative Library Network Berlin-Brandenburg (KOBV), which preceded the metasearch systems a few years back: At KOBV, MetaIndex
enables each of the consortium members to maintain its library system and cataloging conventions while the consortium provides
a single search interface for end users. MetaIndex has now become a resource available to MetaLib at KOBV, along with other
resources. No doubt that a local repository of indexes has many advantages. Information that is gathered and processed prior
to queries can be organized, evaluated, and de-duplicated and therefore can be accessible to end-users in a rapid and comprehensive
manner. However, maintaining such a repository has a major drawback: the repository is another system, with hardware and software,
to create and maintain, and personnel must be available to take care of it.
Considering libraries' budget constraints and limitations in the technical expertise available to them, a combination of just-in-case
and just-in-time approaches would be optimal for metasearch systems. Local repositories would be useful in the following cases:
• |
When no searching mechanism exists at the resource end. This situation is typical of various types of local repositories,
such as those that hold research papers written by institution members or spreadsheets relevant to institutional activities;
but it could obviously apply to any other data that have not yet been made available to the public. |
• |
When the information is scattered. A local repository may be worthwhile if several resources that are mutually compliant form
a single resource of value to the institution. For example, a worldwide organization that has dozens of branches, each of
which holds regionally relevant information, wants to provide a simultaneous search capability that will cover all the local
information. Creating an index such as MetaIndex would be preferable to requiring users to search all the repositories simultaneously. |
• |
When the interface is not reliable. Some institutions want to provide access to resources that are not always online or do
not offer reliable networking for accessing them. In such cases, an institution might be better off harvesting the information
and keeping it as a local repository. |
• |
When preprocessing is important. Preprocessing tasks such as relevance ranking and the elimination of duplicate records can
be of value for some institutions. However, a component like MetaIndex can provide a solution only if the search scope is
defined and limited. For instance, at KOBV, the consortium catalogs represent a limited search scope; as a result, the mathematics
department of the consortium was able to develop a sophisticated de_duplication algorithm that permitted the construction
of a comprehensive MetaIndex component. |
MetaIndex from Ex Libris is created through the harvesting of information from other repositories. One of the harvesting mechanisms
is the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). The use of such harvesting protocols can facilitate the gathering of data and is applicable to a wide range of
resources that are now becoming OAI compliant. Furthermore, MetaIndex itself can become OAI compliant, thus serving as both
a resource for MetaLib and an OAI-compliant resource that enables other systems to harvest the data from it.
SUMMARY
The promise of a truly integrated environment in a heterogeneous world may not yet be a reality, but with the active involvement
of all the stakeholders, significant progress has been made. Just a few years back, metasearch systems seemed like a dream;
today they are already a building block in the information resource environment serving the academic and research community.
REFERENCES
Sadeh, T. & J. Walker.: "Library portals: toward the semantic Web". New Library World 104(2003)1/2, 11-19.
WEB SITES REFERRED TO IN THE TEXT
Ex Libris. http://www.aleph.co.il/
Information Environment (IE) Service Registry. http://www.mimas.ac.uk/iesr/
MIMAS - Manchester Information & Associated Services. http://www.mimas.ac.uk/
NISO MetaSearch Initiative. http://www.niso.org/committees/metasearch-info.html
The Open Archives Initiative Protocol for Metadata Harvesting.
http://www.openarchives.org/OAI/openarchivesprotocol.html
SemanticWeb.org - The Semantic Web Community Portal. http://www.semanticweb.org/
UKOLN. http://www.ukoln.ac.uk/
University of Liverpool. http://www.liv.ac.uk/
Z39.50. http://lcweb.loc.gov/z3950/agency/
Notes
[1] The term federated searching is used by some to describe a process in which indexes are 'pregenerated'. We refer to this
concept as 'just-in-case' processing, as explained later in this paper.
[2] We provide this example only to illustrate the process; references to specific resources are not necessarily accurate.
[3] The issue of copyrights is not discussed in this paper. In this context, we assume that the system that offers the material
handles the copyright issues.
[4] For instance, some products offer unified searching, but once the user requests the result record, the software links the
user to the record in the resource's native interface. Such products do not need to maintain all the technical metadata that
is required for manipulating the retrieved record and converting it to a unified format.
LIBER Quarterly, Volume 13 (2003), No. 3/4