LIB FACIT / 1-1044



The project started in January 1993 and finished in February 1996.


Objectives of the project

The project has been concerned with two main questions:

Results


1) The first objective has been achieved in the sense that the project has shown that such retroconversion can only be expected to be feasible under certain conditions. The Achilleus' heel of fully automatic procedures in retroconversion is still the speed and quality of OCR. And this depends to a large extent on the state of the source material. The project was based on the assumption that commercially available equipment and OCR programs would be able to handle older type- written and printed catalogue cards in a satisfactory way, so that the main effort could be aimed at formatting the cards, but very much more time has been spent on problems of OCR than was originally envisaged.

The results concerning OCR are based mainly on the use of commercially available scanners and OCR packages in the lower or middle price range (such as seem attractive to most libraries). The results may have been different if more sophisti- cated (and more costly) equipment had been used, or even custom built equipment. But in general the conclusion has to be that many older card catalogues are not suitable for this type of methodology because of the state of the source material: Yellowed by age, worn, smudged, with handwritten additions, sometimes swollen or made uneven by dampness, written with a series of typewriters with varying typefaces and with ribbons that are more or less worn out, copied by stencilling etc.

The conclusion is not that the methodology is not feasible at all, but that its application is limited to fairly "well behaved" catalogues. A library wondering whether to apply scanning and OCR to retroconversion should carry out extensive tests in order to asses the suitability of this.

Formatting the cards after scanning and OCR does not, on the other hand, seem to present serious problems, if the output from OCR have a low level of errors. Based on a thorough formal analysis of the catalogue and the rules used in producing it, it will in most cases be possible to write a series of programs specific to that catalogue to do the job. This is confirmed by other projects.

The main focus of the project was to investigate the possibility of producing one application, able to handle a wide range of card catalogues as found in European libraries, avoiding the necessity of writing the formatting programs from scratch every time. This is done by feeding the application a formal description of the catalogue at hand, using a relatively simple formal language. At the same time the application should provide a set of integrated tools for the range of different procedures that go into retronversion work. The project has demonstrated that this is in fact feasible. But the work of formal analysis is quite demanding, both in terms of time and the necessary skills and knowledge. And it will have to be done again with each new catalogue, since no two catalogues are exactly alike.

This process is needed in order to produce the necessary formal specifications for the formatting programs, both with a system like the FACIT Prototype and with custom built formatting programs. This is definitely a specialist job. With automatic conversion lot of the costs go into setting up and testing the system with each new library and each new cata- logue. This means that this methodology is not suitable for a small or medium size library to handle alone without expert assistance - from a commercial service or a large library that has already done some work in this area.

An important problem that has not been solved in a satisfactory way in this project, is the need for detecting and correcting errors produced by scanning and OCR. The project has investi- gated various possible solutions, and it seems worth while to pursue this further. Meanwhile corrections will have to be done by the human operator with some support from the computer.

2) The second objective of the project, as stated above, has only been partly reached. A software package has been developed that is able to demonstrate the principles involved in automatic formatting of library catalogues and in customizing the proce- dures for use in libraries with widely different cataloguing practices as well as catalogues produced over time to different specifications. But the package does not include more advanced facilities for error detection and correction, and it still lacks a series of features that are necessary for use in large scale conversion of catalogues. Nevertheless the results of the project are promising for further development work, and constitute a solid basis for future work by the partners and the subcontractors of the project as well as others. The aim of the published reports is therefore to make available the information generated by the project, in order to help making realistic judgements about the prospects of using the methodology described in a particular library for the conversion of a particular catalogue, and in order to make the information useful for other research and development projects.


Reports



The published reports from the FACIT project consists of the following:

Optical Character Recognition for Retroconversion of Catalogue Cards: Hardware, Software and Character Representation. By Niels Erik Wille. (FACIT Technical Report no 1). Statens Bibliotekstjeneste, Copenhagen. October 1996.
The report summarizes the experiences with scanners and OCR programs. Special treatment is given to the question of character sets and representation of characters, since this is normally of great importance in converting multilingual catalogues.

A Framework for the Analysis of Catalogue Cards. By Niels Erik Wille and Vera Valitutto. (FACIT Technical Report no 2). Statens Bibliotekstjeneste, Copenhagen. Revised version, October 1996.
The report describes the problems involved in analysing a catalogue in order to evaluate the feasability of converting it by automatic means, as well as the formal lan- guage to be used in setting up the FACIT Prototype. This information should also be useful for someone aiming at developing similar tools for retroconversion.

Error Analysis and Correction in Retroconversion. By Hans Erik Jensen (FACIT Technical Report no 3). Statsbiblioteket, Aarhus. October 1996.
The report summarizes the issues involved in automatic or semiautomatic error detection and correction, and outlines plans for further development of the Prototype in order to incorporate more sophisticated handling of OCR errors.

The FACIT Prototype. Manual amd Documentation. By SYNERGI (FACIT Technical Report no 4). Statens Bibliotekstjeneste, Copenhagen. October 1996.
The report describes the Prototype in detail and the procedures to use when setting up the demonstration version. The level of information is highly technical. Due to a series of limitations the demonstration Prototype is not suitable for large scale conversion work, but using it with a smaller sample will provide a good grasp of the problems and procedures involved in automatic formatting etc.

Retroconversion of Older Card Catalogues using OCR and Automatic Formatting. Project Overview and Final Report. By Niels Erik Wille (FACIT Technical Report no 5). Statens Bibliotekstjeneste, Copenhagen. October 1996.
This report presents the project as a whole and the main results reached. It includes a summary of the information included in the previous reports.

These reports are available free of charge.

Demonstration Prototype

A workable demonstration version of the FACIT Prototype is available. This is a combination of a suite of DOS programs and an interface produced as an application for Microsoft Access. The Prototype will run on a PC with Windows 3.11 or Windows 95 and Microsoft Access 2.0 or later versions.

The Demonstration Prototype is available free of charge for use in European libraries.


Correspondence

All correspondence concerning the reports and the Prototype should be sent to:

Niels Erik Wille
Senior lecturer
Dept. of Computer Science, Communication and Education
Building P4
Roskilde University
P.O.Box 260
DK-4000 Roskilde

or posted by e-mail to: new@ruc.dk (Internet)

Copies of the reports and the demo-version of the FACIT Prototype are also available from this web-side by activating the appropriate links above..


Last revised 18 November 1996