Harvesting metadata with OAI-PMH
Collecting, organizing, and maintaining the information is only half the battle. Without access these pro- cesses are meaningless. Additionally, there may be sets of content providers who have sets of informa- tion with something in common such as subject matter (literature, mathematics, gardening), file format (images, texts, sounds), or community (a library, a business, user group). These sets of people may want to co-operate by assimilating information about their content together into a single-source search engine and therefore save the time of the user by reducing the number of databases people have to search as well as provide access to the provider's content.
The OAI addresses the problems outlined above by articulating a method -- a protocol built on top of HTTP -- for sharing meta data buried in Internet-accessible databases. The protocol defines two entities and the language whereby these two entities communicate. The first entity is called a "data provider" or a "repository". For example, a data provider may have a collection of digital images. Each of these im- ages may be described with a set of qualities: title, accession number, data, resolution, narrative descrip- tion, etc. Alternatively, a data provider may be a pre-print archive -- a collection of pre-published pa- pers, and therefore each of the items in the archive could be described using title, author, data, summary, and possibly subject heading. Another example could be a list of people, experts in field of study. The qualities describing this collection may be name, email address, postal address, telephone number, and institutional affiliation.
Thus, the purpose of the first OAI entity -- the data provider -- is to expose the qualities of its collection
the meta data -- to a second entity, a "service provider". The purpose of the service provider is to har-
vest the meta data from one or more data providers in turn creating a some sort of value-added utility. This utility is undefined by the protocol but could include things such as a printed directory, a federated index available for searching, a mirror of a data provider, a current awareness service, syndicated news feeds, etc.
In summary, the OAI defines two entities (data provider and service provider) and a protocol for these two entities to share meta data between themselves. The balance of this article describes the protocol in greater detail.
The OAI protocol consists of only a few "verbs" (think "commands"), and a set of standardized XML re- sponses. All of the verbs are communicated from the service provider to a data provider via an HTTP re- quest. They are a set of one or more name/value pairs embedded in a URL (as in the GET method) or encoded in the HTTP header (as in the POST method). Most of the verbs can be qualified with addition- al name/value pairs. The simplest verb is "Identify", and a real example of how this might be passed to a data provider via the GET method includes the following:
The example above assumes there is some sort of OAI-aware application saved as the default executable in the /alex/oai directory of the www.infomotions.com host. This application takes the name/value pair, verb=Identify, as input and outputs an XML stream confirming itself as an OAI data provider.
Other verbs work in a similar manner but may include a number of qualifiers in the form of additional name/value pairs. For example, the following verb requests a record, in the Dublin Core meta data format, describing Mark Twain's The Prince And The Pauper:
ht- tp://www.infomotions.com/alex/oai/?verb=GetRecord&metadataPrefix=oai_dc&identifier=twain-prince -30 [ h t t p : / / w w w . i n f o m o t i o n s . c o m / a l e x / o a i / ? v e r b = G e t R e c o r d & m e t a d a t a P r e f i x = o a i _ d c & i d e n t i f i e r = t w a i n - p r i