Last modified: February 13, 2012
Contents
Editor: | Rene Baston, Christoph Dalitz |
---|---|
Version: | 1.0.6 |
Use the 'Addons' section on the Gamera home page for access to file releases of this toolkit.
The purpose of the OCR Toolkit is to help building optical character recognition (OCR) systems for standard text documents. Even though it can be used as is, it is specifically designed to make individual steps of the recognition system customizable and replacable. The toolkit is based on and requires the Gamera framework for document analysis and recognition. As an addon package for Gamera, it provides
A comprehensive overview of design, usage and customization of the OCR toolkit can be found in the paper
C. Dalitz, R. Baston: Optical Character Recognition with the Gamera Framework. In C. Dalitz (Ed.): "Document Image Analysis with the Gamera Framework." Schriftenreihe des Fachbereichs Elektrotechnik und Informatik, Hochschule Niederrhein, vol. 8, pp. 53-65, Shaker Verlag (2009)
Optical character recognition (OCR) means the extraction of a machine readable text code from bitmap images of text documents. This process typically consists of the following steps:
The OCR toolkit only covers the process from segmentation to postprocessing. For preprocessing, the standard routines shipped with Gamera must be used beforehand, e.g. rotation_angle_projections for skew correction, or despeckle for noise removal.
For classification, the kNN classifier shipped with Gamera must be used. This means in particular, that you must train some sample pages before doing the classification. At present, the toolkit does not include training databases for common fonts.
The toolkit consists of two python modules, a plugin image function and one end user application.
The modules are
The end user application is
There is also one image plugin bbox_seg for textline segmentation which is simply a wrapper around the Gamera core plugin bbox_segmentation.
As the segmentation of the individual characters is based on a connected component analysis, the toolkit cannot deal with touching characters, unless they have been trained as ligaturae. It is therefore in general only applicable to printed documents, rather than handwritten documents.
From a user's perspective, there are some points to beware in this toolkit:
This documentation is written for those who want to use the toolkit for OCR, but are not interested in extending the toolkit itself.
This documentation is for those who want to extend the functionality of the OCR toolkit, or who want to customize specific steps of the recognition process.
We have only tested the toolkit on Linux and MacOS X, but as the toolkit is written entirely in Python, the following instructions should work for any operating system.
First you will need a working installation of Gamera 3.x. See the Gamera website for details. It is strongly recommended that you use a recent version, preferably from SVN.
If you want to generate the documentation, you will need two additional third-party Python libraries:
Note
It is generally not necessary to generate the documentation because it is included in file releases of the toolkit.
To build and install this toolkit, go to the base directory of the toolkit distribution and run the setup.py script as follows:
# 1) compile python setup.py build # 2) install sudo python setup.py install
Command 1) compiles the toolkit from the sources and command 2) installs it. As the latter requires root privilegue, you need to use sudo on Linux and MacOS X. On Windows, sudo is not necessary.
Note that the script ocr4gamera is installed into /usr/bin on Linux, but into /System/Library/Frameworks/Python.framework/Versions/2.x/bin on MacOS X. As the latter directory is not in the standard search path, you could either add it to your search path, or install the scripts additionally into /usr/bin on MacOS X with:
# install scripts into standard path (MacOS X only) sudo python setup.py install_scripts -d /usr/bin
If you want to regenerate the documentation, go to the doc directory and run the gendoc.py script. The output will be placed in the doc/html/ directory. The contents of this directory can be placed on a webserver for convenient viewing.
Note
Before building the documentation you must install the toolkit. Otherwise gendoc.py will not find the plugin documentation.
The above installation with python setup.py install will install the toolkit system wide and thus requires root privileges. If you do not have root access (Linux) or are no sudoer (MacOS X), you can install the MusicStaves toolkit into your home directory. Note however that this also requires that Gamera is installed into your home directory. It is currently not possibole to install Gamera globally and only toolkits locally.
Here are the steps to install both Gamera and the OCR toolkit into ~/python:
# install Gamera locally mkdir ~/python python setup.py install --prefix=~/python # build and install the OCR toolkit locally export CFLAGS=-I~/python/include/python2.3/gamera python setup.py build python setup.py install --prefix=~/python
Moreover you should set the following environment variables in your ~/.profile:
# search path for python modules export PYTHONPATH=~/python/lib/python # search path for executables (eg. gamera_gui) export PATH=~/python/bin:$PATH
The installation uses the Python distutils, which do not support uninstallation. Thus you need to remove the installed files manually:
All python library files of this toolkit are installed into the gamera/toolkits/ocr subdirectory of the Python library folder. Thus it is sufficient to remove this directory for an uninstallation.
Where the python library folder is depends on your system and python version. Here are the folders that you need to remove on MacOS X and Debian Linux ("2.3" stands for the python version; replace it with your actual version):
- MacOS X: /Library/Python/2.3/gamera/toolkits/ocr
- Debian Linux: /usr/lib/python2.3/site-packages/gamera/toolkits/ocr
The standalone scripts are installed into /usr/bin (linux) or /System/Library/Frameworks/Python.framework/Versions/2.3/bin (MacOS X), unless you have explicitly chosen a different location with the options --prefix or --home during installation.
For an uninstall, remove the following script:
- ocr4gamera.py
Note
In older versions (1.0.0 and 1.0.1) this script was named ocr4gamera. Remove this old script, if you are upgrading from one of these versions.
The documentation was written by Rene Baston and Christoph Dalitz. Permission is granted to copy, distribute and/or modify this documentation under the terms of the Creative Commons Attribution Share-Alike License (CC-BY-SA) v3.0. In addition, permission is granted to use and/or modify the code snippets from the documentation without restrictions.