LIB
FACIT / 1-1044
The project has been concerned with two main questions:
1) The first objective has been achieved in
the sense that the project has shown that such retroconversion can only
be expected to be feasible under certain conditions. The Achilleus' heel
of fully automatic procedures in retroconversion is still the speed and
quality of OCR. And this depends to a large extent on the state of the
source material. The project was based on the assumption that commercially
available equipment and OCR programs would be able to handle older type-
written and printed catalogue cards in a satisfactory way, so that the
main effort could be aimed at formatting the cards, but very much more
time has been spent on problems of OCR than was originally envisaged.
The results concerning OCR are based mainly on the use of commercially
available scanners and OCR packages in the lower or middle price range
(such as seem attractive to most libraries). The results may have been
different if more sophisti- cated (and more costly) equipment had been
used, or even custom built equipment. But in general the conclusion has
to be that many older card catalogues are not suitable for this type of
methodology because of the state of the source material: Yellowed by age,
worn, smudged, with handwritten additions, sometimes swollen or made uneven
by dampness, written with a series of typewriters with varying typefaces
and with ribbons that are more or less worn out, copied by stencilling
etc.
The conclusion is not that the methodology is not feasible at all, but
that its application is limited to fairly "well behaved" catalogues.
A library wondering whether to apply scanning and OCR to retroconversion
should carry out extensive tests in order to asses the suitability of this.
Formatting the cards after scanning and OCR does not, on the other hand,
seem to present serious problems, if the output from OCR have a low level
of errors. Based on a thorough formal analysis of the catalogue and the
rules used in producing it, it will in most cases be possible to write
a series of programs specific to that catalogue to do the job. This is
confirmed by other projects.
The main focus of the project was to investigate the possibility of producing
one application, able to handle a wide range of card catalogues as found
in European libraries, avoiding the necessity of writing the formatting
programs from scratch every time. This is done by feeding the application
a formal description of the catalogue at hand, using a relatively simple
formal language. At the same time the application should provide a set
of integrated tools for the range of different procedures that go into
retronversion work. The project has demonstrated that this is in fact feasible.
But the work of formal analysis is quite demanding, both in terms of time
and the necessary skills and knowledge. And it will have to be done again
with each new catalogue, since no two catalogues are exactly alike.
This process is needed in order to produce the necessary formal specifications
for the formatting programs, both with a system like the FACIT Prototype
and with custom built formatting programs. This is definitely a specialist
job. With automatic conversion lot of the costs go into setting up and
testing the system with each new library and each new cata- logue. This
means that this methodology is not suitable for a small or medium size
library to handle alone without expert assistance - from a commercial service
or a large library that has already done some work in this area.
An important problem that has not been solved in a satisfactory way in
this project, is the need for detecting and correcting errors produced
by scanning and OCR. The project has investi- gated various possible solutions,
and it seems worth while to pursue this further. Meanwhile corrections
will have to be done by the human operator with some support from the computer.
2) The second objective of the project, as
stated above, has only been partly reached. A software package has been
developed that is able to demonstrate the principles involved in automatic
formatting of library catalogues and in customizing the proce- dures for
use in libraries with widely different cataloguing practices as well as
catalogues produced over time to different specifications. But the package
does not include more advanced facilities for error detection and correction,
and it still lacks a series of features that are necessary for use in large
scale conversion of catalogues. Nevertheless the results of the project
are promising for further development work, and constitute a solid basis
for future work by the partners and the subcontractors of the project as
well as others. The aim of the published reports is therefore to make available
the information generated by the project, in order to help making realistic
judgements about the prospects of using the methodology described in a
particular library for the conversion of a particular catalogue, and in
order to make the information useful for other research and development
projects.
The published reports from the FACIT project consists of the following:
Optical Character
Recognition for Retroconversion of Catalogue Cards: Hardware, Software
and Character Representation. By Niels Erik Wille. (FACIT Technical
Report no 1). Statens Bibliotekstjeneste, Copenhagen. October 1996.
The report summarizes the experiences with scanners and OCR programs.
Special treatment is given to the question of character sets and representation
of characters, since this is normally of great importance in converting
multilingual catalogues.
A Framework
for the Analysis of Catalogue Cards. By Niels Erik Wille and Vera
Valitutto. (FACIT Technical Report no 2). Statens Bibliotekstjeneste, Copenhagen.
Revised version, October 1996.
The report describes the problems involved in analysing a catalogue
in order to evaluate the feasability of converting it by automatic means,
as well as the formal lan- guage to be used in setting up the FACIT Prototype.
This information should also be useful for someone aiming at developing
similar tools for retroconversion.
Error Analysis
and Correction in Retroconversion. By Hans Erik Jensen (FACIT Technical
Report no 3). Statsbiblioteket, Aarhus. October 1996.
The report summarizes the issues involved in automatic or semiautomatic
error detection and correction, and outlines plans for further development
of the Prototype in order to incorporate more sophisticated handling of
OCR errors.
The FACIT Prototype.
Manual amd Documentation. By SYNERGI (FACIT Technical Report no
4). Statens Bibliotekstjeneste, Copenhagen. October 1996.
The report describes the Prototype in detail and the procedures to
use when setting up the demonstration version. The level of information
is highly technical. Due to a series of limitations the demonstration Prototype
is not suitable for large scale conversion work, but using it with a smaller
sample will provide a good grasp of the problems and procedures involved
in automatic formatting etc.
Retroconversion
of Older Card Catalogues using OCR and Automatic Formatting. Project Overview
and Final Report. By Niels Erik Wille (FACIT Technical Report no
5). Statens Bibliotekstjeneste, Copenhagen. October 1996.
This report presents the project as a whole and the main results reached.
It includes a summary of the information included in the previous reports.
These reports are available free of charge.
A workable demonstration version of the
FACIT Prototype is available. This is a combination of a suite
of DOS programs and an interface produced as an application for Microsoft
Access. The Prototype will run on a PC with Windows 3.11 or Windows 95
and Microsoft Access 2.0 or later versions.
The Demonstration Prototype is available free of charge for use in European
libraries.
All correspondence concerning the reports and the Prototype should be sent to:
Niels Erik Wille
Senior lecturer
Dept. of Computer Science, Communication and Education
Building P4
Roskilde University
P.O.Box 260
DK-4000 Roskilde
or posted by e-mail to: new@ruc.dk
(Internet)
Copies of the reports and the demo-version of the FACIT Prototype are also
available from this web-side by activating the appropriate links above..
Last revised 18 November 1996