Book Preservation

Overview

This page will describe my process for creating digital masters from paper books. (Of course, you should have the rights to make the copies.) There are two main flavors of creating digital versions of paper books:

Image based pages

Each page is a photograph of the original paper page.
The process can be a simple as photographing each page, selecting all photos, then printing to pdf.
Advantages:
1. Speed and ease
2. Preservation original layout (no typesetting required)
  This is especially important if there is a lot of fancy formatting.
Disadvantages:
- Lack of support for editing
  Any errors in the original will be carried over into the new copy.
- Doesn't easily support reformatting content or changing the page size
- The resulting PDF file is not searchable
- Quality of the result is highly dependent on the quality of the photographs (details below). In particular, unless the pages can be separated and fed through an auotomatic scanner, it can be difficult to get uniform lighting and avoid distortion from curving pages. Even with an automatic document scanner, pages sometimes get rotated.
- The files tend to be larger than files of equivalent text.

Text based pages

Content is stored as a combination of text and photos.
Advantages:
- Result are usually better than image-based books:
  no more smudged characters, no distortion from photographing pages that aren't flat.
- Supports correcting mistakes in the original text.
- Supports reformatting to fit a different size page.
- Resulting PDF file is searchable
  This is useful for proofreading, even if the final book is to be printed on paper.
- Not very sensitive to the quality of the photographs (lighting, slight distortion or rotation)
- Files with text are more compact than photos of pages
Disadvantages:
- May require time and effort to restore images, select new fonts and layout the pages.
- May require time and effort to review the text for errors

Because my goal was to restore old books whose print quality had already degraded, I opted for the second approach.

Some topic to be covered:

Scanning
OCR
Handling of Japanese Characters

Scanning

If the pages of the original book are separable (e.g. spiral bound or falling apart) then it is probably easiest to feed them through a document scanner.
Home 3-in-one printer/scanner/copiers sometimes have document feeders, but they can be slow, may not support two-sided scanning and in my experience are prone to jamming.
An alternative is to take the book to a local photocopying business that offers scanning to a PDF.

If the original is a bond book and you don't want to cut the spine off, you might want to use a camera-based scanner.

Single Camera Systems

One type of scanner photographs the book laying open.
Distortion from the curve of the pages is removed digitally.
Examples of these scanners in the $500-$600 range are

CZUR
(CZUR is pronounced like "Caesar", not "seizure")
1.5 seconds per two (facing) pages, 250-330 dpi
Fujitsu SV600
3 seconds per page, up to 600 dpi in color and 1200 dpi in grayscale.

Advantages

One stop shopping: scanning, image correction, OCR
Compact format

Disadvantages

Digital correction of images may degrade image quality
Possible lack of detailed control over processing
Large up front cost
Design may allow in uneven ambient light, which can make it difficult to get a uniform and natural-looking image.
Some experimentation might be required to find a good environment for scanning.

Dual Camera Systems

Dual camera systems partially open the book and press the pages against glass plates arranged in a "V".
The idea is that the book is only partially open and the pages are kept flat by pressing them against glass plates.
This arrangement eliminates the need for digital distortion detection and correction.
Also, the spine of the book is automatically aligned with the "V" so page rotation is minimized.

For the DIYer, a design for a dual camera scanner is described here DIY Book Scanner.
Resolution can be upgraded by swapping out the consumer-grade cameras.
Claimed throughput is 1000 pages per hour.

I have good luck using one of these scanners, built from a kit. Archivist Quill.
I have written some image post-processing tools for cropping and background removal.
These are likely to be my first open source project.

Things to keep in mind:

Control your light.
The main illumination will come from the light built in to the scanner.
If light from the room also hits the book, then the color may not be what the camera is expecting.
This can be corrected with white balance compensation.
But if the operator is partially blocking the room lights, then the uneven lighting might make it more difficult for the software to produce a uniform and natural looking image.
No single white balance correction will work with the whole image.
One fix for this is to build a tent to block out the room light.

Text and photographs may require different types of post-processing.
For text, I like to see black text on a white background so I want to reduce the details like gray in the text or light speckles in the background.
For photographs, I want minimal loss of detail.
If your software does not automatically detect and apply appropriate processing to photos, you may want to save the unprocessed page images and apply your own processing to them.

Scanning for OCR may be more forgiving than scanning for an image-based pdf book.
If your OCR requires grayscale or black and white images, you won't need to obsess over perfect white balance.
And my OCR seems pretty insensitive to slight distortion and rotations.
But you may still want to remove background noise.
It's probably a good idea to start by processing only a few representative pages to make sure the processing works as you want.

Optical Character Recognition (OCR)

[Work in progress.]

Things to keep in mind:

OCR is not perfect.
The bulk of my processing time is usually spent proofreading and correcting OCR mistakes.
Though to be fair, this may be because I have been preserving/restoring degraded paper copies.
Sometimes it was hard for me to read the originals.
OCR can give you unexpected formatting.
When exported to Word, a page can look about right, but the text, including headers and page numbers, is put into text boxes.
If you add some content, the page numbers won't automatically update and the text won't automatically shift to the next page.
OCR can work on mixed language texts.
I have done several projects with mixed English and Japanese.