Tools for recognizing text on collections of pages.
ocropus-nlbin, purpose, noise and border removal, page rotation correction, grayscale normalization, binarization, usage, ocropus-nlbin image1.png image2.png ..., inputs, image1.png, outputs, image1.nrm.png, normalized grayscale image, image1.bin.png, binarized image, notes, optimized for 300dpi book pages, other kinds of inputs may require parameter adjustments, parameters, TBD
ocropus-gpageseg, purpose, column finding, text line finding, usage, ocropus-gpageseg page1.png page2.png ..., ocropus-gpageseg 'book/????.png', inputs, page1.png, binarized or normalized page image, won't work well on other kinds of grayscale inputs, also looks for page1.nrm.png and page1.bin.png and uses them if available, outputs, page1.pseg.png, page segmentation file, page1/010001.png, text lines, page1/010001.nrm.png, grayscale text line (if .nrm.png page image was available), notes, optimized for 300 dpi book pages, other kinds of inputs may require parameter adjustment, this program is really more of a placeholder and will be replaced by trainable layout analysis, limitations, no text/image segmentation; pages with complex figures will result in a lot of "noise", column finding performance is variable depending on the input documents, text lines printed in very large fonts will get erroneously split horizontally, leading to misrecognition, TODO, improve column finding, add image detection / removal
ocropus-rpred, purpose, apply an neural network recognizer, usage, ocropus-rpred line1.png line2.png ..., inputs, line1.bin.png, will automatically expand glob patterns in its arguments, outputs, line1.txt, notes, recognizer is currently trained on UW3 and UNLV data; you can load other models with -m, also performs text line normalization, limitations, current models are trained on modern type fonts and flatbed scans (UW3, UNLV); performance will be worse on other kinds of inputs, parameters, -m model.pyrnn, a line recognizer model is just a pickled Python object, TBD
ocropus-hocr, purpose, outputs hOCR encoded recognition output, usage, ocropus-hocr bookdirectory -o book.html, notes, outputs hOCR format, including bounding boxes for text lines and per-line baseline information, also outputs font size information, TODO, lots of bug fixing and testing using the new pipeline, optionally output word and character bounding boxes, output more font-related information
Tools for training OCRopus to recognize new scripts and languages.
ocropus-rtrain, purpose, train a neural network recognizer, usage, ocropus-rtrain -o newmodel.pyrnn line1.png line2.png ..., inputs, line1.png line1.gt.txt etc., will automatically expand glob patterns in its arguments, outputs, new recurrent neural network model, notes, transcriptions should be in Unicode, internally, text lines will be normalized prior to training/recognition; the normalization process may need modification for some scripts, parameters, -s 4, show every fourth input for debugging purposes, TBD
ocropus-gtedit html, purpose, line-wise correction of input text, usage, ocropus-gtedit html -o mypage.html *.png, inputs, PNG image files representing text lines, corresponding .txt or .gt.txt files, outputs, an HTML file containing both images and corresponding text (in text input boxes)
ocropus-gtedit extract, purpose, extract ground truth from an edited HTML file created with ocropus-gtedit, usage, ocropus-gtedit extract -o mydir mypage.html
Tools that helps with debugging and identifying the sources of recognition errors.
ocropus-showline, purpose, show text line recognition results and details
ocropus-showpage, purpose, show page recognition results and details, usage, ocropus-showpage book/0001.bin.png, ocropus-showpage book/0001.bin.png -o debug.png, TODO, ocropus-showpage 'book/????.bin.png' --index index.html
ocropus-submit, purpose, submit a bug report
These are the tools for an older, character level recognizer.
recognition, ocropus-lattices, purpose, compute recognition lattices for text lines, usage, ocropus-lattices line1.png line2.png ..., inputs, line1.png, optionally uses line1.nrm.png, will automatically expand glob patterns in its arguments, outputs, line1.rseg.png, line1.lattice, optionally outputs line1.txt, notes, recognizer is currently trained on UW3 and UNLV data, input text should be around 300 dpi, rescale prior to recognition if necessary, will refuse to recognize lines that are too small or too large (override with --xheightrange), examine the output/performance with "ocropus-showline image.png"; it will display all the intermediate recognition results, also performs text line extraction (see below), limitations, current models are trained on modern type fonts and flatbed scans (UW3, UNLV); performance will be worse on other kinds of inputs, current models have not been trained on ligatures or unsegmentable pairs, leading to some digraphs being misrecognized frequently in some printing styles, parameters, -m model.cmodel, a character model is just a pickled Python class with a coutputs method, input to the character recognizer is a normalized character model (depending on the normalization mode), TBD, ocropus-ngraphs, purpose, perform language modeling, usage, ocropus-ngraphs 'book/????/??????.png', ocropus-ngraphs line1.lattice line2.lattice ..., inputs, line1.lattice, a lattice produced by ocropus-lattices, will automatically expand glob patterns in its arguments, outputs, line1.txt, recgonized text, line1.cseg.png, aligned character segmentation for the recognized output, notes, language models are easy to train: all you need is a collection of text files, and you run ocropus-ngraphs --build over them; see below, parameters, There are many parameters that trade off the language model against characters, etc. They can make a big difference., You need to experiment (with some scripts generating various different combinations) to see what works for you., Parameters will depend on how similar your unknown text is to your inputs, how good your character recognizer is, and how good its weights are, Language models make a distinction between weights and whether a string is in the language model at all., There are a number of language models (default-2.ngraphs through default-6.ngraphs) available for download (see the web site). There were derived from Project Gutenberg texts, as well as the UW3 and UNLV databases., limitations, language models may not help recognition rates much, and can hurt recognition, language model parameters are highly dependent on both quality and content of the input, TODO, simple back-off, discriminative training of weights
character training, ocropus-tsplit, purpose, computes a tree vector quantizer for its inputs, ocropus-tleaves, purpose, computes terminal classifiers for each vector quantization bucket, ocropus-wtrain, purpose, trains a whitespace model, TODO, training of whitespace models is currently broken; this command will be replaced, ocropus-db, purpose, manipulate character databases, character databases are stored in HDF5 format, subcommands, cat db1 db2 ... -o db, concatenates one or more databases, optionally selecting classes and changing class labels, shuffle db -o out-db, take a random sample from a database or shufflle all its elements, tess2h5 *.tiff -o tess.h5, converts Tesseract box files to OCRopus HDF5 databases for training, predict -m model db.h5, evaluate a character recognizer on a database, cnormalize input -o db.h5, change a database from line-based size normalization to character based size normalization, zoom -f factor input -o db.h5, change the resolution of a database, ocropus-cedit, purpose, show and edit character databases, TODO, show and edit treevq files
language models, ocropus-ngraphs --build, purpose, build ngraph language models from a text corpus
data generation, ocropus-align, purpose, aligns text line images with ground truth, TODO, output line1.cseg-aligned.png, perform page level alignment (as before); maybe separate tool, ocropus-lattices --extract, purpose, extract individual characters for training, usage, ocropus-lattices --extract -o chardb.h5 image1.png image2.png, ocropus-lattices --extract -o chardb.h5 'book/????/??????.png', ocropus-linegen, purpose, generate text lines for training, TODO, port over from llpy, ocropus-db tess2h5, see above
visualization, ocropus-ngraphs --print, usage, ocropus-ngraphs --print some.lattice, purpose, prints a textual representation of the recognition lattice
file formats, File formats are documented externally; please follow the links., .pseg.png, http://goo.gl/9IfXM, .rseg.png, http://goo.gl/9IfXM, .cseg.png, http://goo.gl/9IfXM, .lattice, http://goo.gl/9IfXM, .aligned, http://goo.gl/9IfXM, hOCR output, http://goo.gl/zXNrC
directory layout, Although many OCRopus commands work on individual files, there is a standard directory layout that some tools expect. It consists of a top level directory containing files of the form 0001.extension for each page, and containing subdirectories of the form 0001, that in turn contain files for each line., book, 0001.bin.png, A deskewed, binarized page image., 0001.nrm.png, A deskewed, intensity normalized page image., 0001.pseg.png, A color-coded page segmentation file., 0001, There is one of these directories for each input page., Produced by the page layout analysis program (e.g., ocropus-gpageseg), Each file name corresponds to the hex RGB value of the corresponding pixels in the 0001.pseg.png page segmentation file., 010001.bin.png, A binary text line image., Produced by e.g., ocropus-gpageseg, 010001.nrm.png, An intensity normalized grayscale text line image., Produced by e.g., ocropus-gpageseg, 010001.rseg.png, A raw character segmentation image., Produced by ocropus-lattices, Some tools assume that this is also a good binarization of the input (FIXME), but really a good binarization is this image masked with the .bin.png image, 010001.lattice, A recognition lattice., Produced by ocropus-lattices, 010001.cseg.png, An aligned character segmentation., Produced by ocrpous-align and ocropus-ngraphs, 010001.aligned, Aligned character data., Produced by ocropus-align and ocropus-ngraphs (FIXME), This encodes ligatures and pseudo ligatures using notation like "_fi_"., 010001.gt.txt, Ground truth text used for training., Used as input by ocropus-align
These are the models responsible for recognition.
.cmodel, A pickled python object representing a character classifier., members, coutputs(image) -> [(cls,prob),...], Returns a list of classes and probabilities for the given input image., sizemode, indicates how characters are expected to be scaled for this classifier, "perchar" - use rescaling as implemented by improc.classifier_normalize, "perline" - use rescaling as implemented by improc.line_normalize, comments, training with "ocropus-tsplit" and "ocropus-tleaves" or other kinds of character training tools, primarily responsible for recognition of characters, cmodels may expect different size normalization of their inputs, mismatches between normalization and sizemode will lead to very poor recognition performance, the sizemode is stored in the sizemode member (see above), Latin script is particularly sensitive to the size and position of characters and requires "linerel" mode; for many other scripts (Chinese, Japanese, Greek, Hebrew, etc.), the "perchar" sizemode is easier to train and use, If characters are trained with "linerel" mode, you need to have a .lineest model
.lineest, A pickled Python object with two methods returning polynomial models of the baseline and xline., members, lineFit(image) -> (bl_poly,xl_poly), returns two polynomials suitable for evaluation with polyval (just a list of coefficients), one for the baseline, one for the x line, comments, training with "python -m ocrolib.lineest" command, if the script is similar to Latin script, probably doesn't need retraining, Line estimators are only needed for scripts in which relative size and position of characters makes a significant difference. For most scripts other than Latin scripts this isn't the case.
.ngraphs, A pickled Python object representing conditional probabilities., members, all the members are private (the class is only used by ocropus-ngraphs), comments, training with "ocropus-ngraphs" command, this is easy to train, and needs to be trained even for different corpora, recognition performance is quite sensitive to the parameters given to ocropus-ngraphs (e.g., language model weight etc.); this tricky tradeoff between language models and character recognition is true of all OCR engines
.wmodel, A pickled Python object estimating the probabilities that there is a space at a particular location., members, setLine(image), Sets the line image., classifySpace(x) -> (no,yes), Returns the probability that there is no space, a space, for the character ending at location x in the input image., comments, the training tools for this is currently broken; it will be replaced
default, a few default paths, this is also read by setup.py, TODO, rename to defaults.py
toplevel, decorators etc. intended to be imported at toplevel (very few), in particular, contains the @checks decorator, which checks arguments to Python functions, there is a large number of argument checks defined, relevant to image processing and OCR
common, reading and writing of images, segmentation files, etc., dealing with file names and directories, some general simple image processing functions, some unicode-related utilities, warnings, checks, loading and instantiating components, TODO, move exceptions and annotations to toplevel.py, use warning, checks, etc. more consistently, use mkpython more consistently, remove old make_I... functions, move rect_union to sl, move drawing functions into separate module, make gt_explode/implode into Python codec
sl, utilities for dealing with Python tuples-of-slices as rectangles
native, a small library from embedding C code inside Python
morph, morphological operations to completely scipy.ndimage.morphology
improc, various image processing operations
lineproc, image processing specifically related to text lines
psegutils, image processing specifically related to page images
patrec, various pattern recognition algorithms
mlinear, linear classifiers, logistic regression
mlp, MLPs and backpropagation
wmodel, white space models
lineseg, line segmentation by connected components, line segmentation by dynamic programming cuts
linerec, text line recognition
lineest, line geometry estimation, finds the baseline and xline of text lines, trained models of the form name.lineest
lang, language and script related data, TODO, merge with ligatures.py
ligatures, encoding and decoding of ligatures as integers, related functions
lattice, reads lattice files and augments them with some transitions; used for both alignment and language modeling, notes, you should usually use linerec.read_lattice and linerec.write_lattice, TODO, change the name to something less generic