Digitizing Books, Fast and Affordable
I had a recent chat with a librarian who hosts a group of rare books in different languages. He is using proprietary software to scan and thereafter convert the text digitally. After this, he intends to keep the books preserved and allow electronic library lending. He already has a dedicated fileserver setup for sharing his electronic library. I did point him to Project Gutenberg and Project Madurai (an Indic/Tamil Language initiative intending to do what PG did for Ancient and Classical works in Tamil.)
However, he wasn't happy with the fact that scanning books took ages, usually ranging from a week to several weeks requiring a sizable amount of storage. Some of the scanned images could be automated for naming conventions, but that was how far he could get.
Looking at the FAQ available at PG, I found that they too used "Volunteer" scanners and the volunteer scanners were recommended to use a traditional flat-bed scanner (limiting page sizes of scans.) Some Scanners allow Automatic Data Feed (ADF) which helps if you can disassemble the book from its binding. For classical works, this is probably very tricky and sometimes impossible. The alternative is to take a two page scan and finally post process each scan. You can check out the details here at the Scanning FAQ.
Many of the books which are scanned are fragile and use thin paper. That prevents removal of the binding or in some rare cases laying them flat for scanning. Yes, this is a known problem to the gadget makers who build scanners. You can read about some of the gadgets specialized for scanning books here. (Wikipedia Link)
If you did check the Wikipedia link, you would see the open versions which specifically are designed to accommodate a variety of sizes as compared to the above closed unit which is less sensitive to ambient lighting and requires less powerful lighting (and therefore saves your power budget which is critical today.)
Now, why would this be an interesting space for innovation? The answer lies in the pricing of available solutions which are bundled with software. Flatbed Book Scanners are different from traditional Flatbed scanners by the very fact that they offer high speed scanning specifically for standard book sizes (the fastest close to about 3s per page.) The Flatbed Book Scanners do not automatically turn pages, which obviously is the tough part and also risk damaging the book.
If you wanted to look for a solution that was complete and turned pages automatically, you will find the following listed.
Now looking at numbers upto $10,000, this is definitely no easy investment for NGOs or even Governmental Libraries in the Developing Nations. The pricey components are actually the Camera, which in many of these cases is
- a High End CCD with a lens capable of autofocus and control.
- Robotic or Non-Touch techniques for page turning
- Illumination with automatic adjustment for paper type, quality and color
- Bundled Software for OCR and automated PDF output
- Non-Standard Adjustable mechanical casing
- Demand Volume is much lower to throttle pricing creating a vicious cycle
If you really want to penetrate this space, then there is quite a lot that can be done. Here are a few suggestions
- Use CMOS Cameras with In-built Autofocus DSP functions (Monochrome should suffice for most needs.)
- Use in-built DSP for quick post-processing and OCR (Processors like the OMAP already make it possible.)
- Use minimal illumination and partially closed box built on traditional flatbed scanner or OHP dimensions.
- Improvise on non-touch page turning (Robotics will need calibration and have component scalability issues.)
- Bundle Software that integrates well with Content Management Systems to support multiple output formats
I strongly feel that beating that price point and reaching a larger volume of people will create a compelling business case. New books are usually created on digital media and therefore have a digital alternative available. It is the volumes of old books containing priceless information that would need to be scanned. One could imagine a host of business models for such a case including Library centric Data centers hosting readable material accessible on almost any reading device.