If the pages of the original book are separable (e.g. spiral bound or falling apart)
then it is probably easiest to feed them through a document scanner.
Home 3-in-one printer/scanner/copiers sometimes have document feeders,
but they can be slow, may not support two-sided scanning and in my experience are prone to jamming.
An alternative is to take the book to a local photocopying business that offers scanning to a PDF.
If the original is a bond book and you don't want to cut the spine off,
you might want to use a camera-based scanner.
Single Camera Systems
One type of scanner photographs the book laying open.
Distortion from the curve of the pages is removed digitally.
Examples of these scanners in the $500-$600 range are
(CZUR is pronounced like "Caesar", not "seizure")
1.5 seconds per two (facing) pages, 250-330 dpi
3 seconds per page, up to 600 dpi in color and 1200 dpi in grayscale.
- One stop shopping: scanning, image correction, OCR
- Compact format
- Digital correction of images may degrade image quality
- Possible lack of detailed control over processing
- Large up front cost
Design may allow in uneven ambient light, which can make it difficult to get
a uniform and natural-looking image.
Some experimentation might be required to find a good environment for scanning.
Dual Camera Systems
Dual camera systems partially open the book and press the pages against
glass plates arranged in a "V".
The idea is that the book is only partially open and the pages are kept flat by pressing them against glass plates.
This arrangement eliminates the need for digital distortion detection and correction.
Also, the spine of the book is automatically aligned with the
"V" so page rotation is minimized.
For the DIYer, a design for a dual camera scanner is described here
DIY Book Scanner.
Resolution can be upgraded by swapping out the consumer-grade cameras.
Claimed throughput is 1000 pages per hour.
I have good luck using one of these scanners, built from a kit.
I have written some image post-processing tools for cropping and background removal.
These are likely to be my first open source project.
Things to keep in mind:
Control your light.
The main illumination will come from the light built in to the scanner.
If light from the room also hits the book, then the color
may not be what the camera is expecting.
This can be corrected with white balance compensation.
But if the operator is partially blocking the room lights, then
the uneven lighting might make it more difficult
for the software to produce a uniform and natural looking image.
No single white balance correction will work with the whole image.
One fix for this is to build a tent to block out the room light.
Text and photographs may require different types of post-processing.
For text, I like to see black text on a white background so
I want to reduce the details like gray in the text or light speckles in the background.
For photographs, I want minimal loss of detail.
If your software does not automatically detect and apply appropriate processing to photos,
you may want to save the unprocessed page images and apply your own processing to them.
Scanning for OCR may be more forgiving than scanning for an image-based pdf book.
If your OCR requires grayscale or black and white images, you won't need to obsess
over perfect white balance.
And my OCR seems pretty insensitive to slight distortion and rotations.
But you may still want to remove background noise.
It's probably a good idea to start by processing only a few representative pages
to make sure the processing works as you want.