OCRopus Overview (Published)

Get Started. It's Free
or sign up with your email address
Rocket clouds
OCRopus Overview (Published) by Mind Map: OCRopus Overview (Published)

1. commands

1.1. recognition

1.1.1. Tools for recognizing text on collections of pages.

1.1.2. ocropus-nlbin purpose noise and border removal page rotation correction grayscale normalization binarization usage ocropus-nlbin image1.png image2.png ... inputs outputs notes optimized for 300dpi book pages other kinds of inputs may require parameter adjustments parameters TBD

1.1.3. ocropus-gpageseg purpose column finding text line finding usage ocropus-gpageseg page1.png page2.png ... ocropus-gpageseg 'book/????.png' inputs outputs notes optimized for 300 dpi book pages other kinds of inputs may require parameter adjustment this program is really more of a placeholder and will be replaced by trainable layout analysis limitations no text/image segmentation; pages with complex figures will result in a lot of "noise" column finding performance is variable depending on the input documents text lines printed in very large fonts will get erroneously split horizontally, leading to misrecognition TODO improve column finding add image detection / removal

1.1.4. ocropus-rpred purpose apply an neural network recognizer usage ocropus-rpred line1.png line2.png ... inputs outputs notes recognizer is currently trained on UW3 and UNLV data; you can load other models with -m also performs text line normalization limitations current models are trained on modern type fonts and flatbed scans (UW3, UNLV); performance will be worse on other kinds of inputs parameters -m model.pyrnn TBD

1.1.5. ocropus-hocr purpose outputs hOCR encoded recognition output usage ocropus-hocr bookdirectory -o book.html notes outputs hOCR format, including bounding boxes for text lines and per-line baseline information also outputs font size information TODO lots of bug fixing and testing using the new pipeline optionally output word and character bounding boxes output more font-related information

1.2. training tools

1.2.1. Tools for training OCRopus to recognize new scripts and languages.

1.2.2. ocropus-rtrain purpose train a neural network recognizer usage ocropus-rtrain -o newmodel.pyrnn line1.png line2.png ... inputs outputs notes transcriptions should be in Unicode internally, text lines will be normalized prior to training/recognition; the normalization process may need modification for some scripts parameters -s 4 TBD

1.2.3. ocropus-gtedit html purpose line-wise correction of input text usage ocropus-gtedit html -o mypage.html *.png inputs outputs

1.2.4. ocropus-gtedit extract purpose extract ground truth from an edited HTML file created with ocropus-gtedit usage ocropus-gtedit extract -o mydir mypage.html

1.3. visualization tools

1.3.1. Tools that helps with debugging and identifying the sources of recognition errors.

1.3.2. ocropus-showline purpose show text line recognition results and details

1.3.3. ocropus-showpage purpose show page recognition results and details usage ocropus-showpage book/0001.bin.png ocropus-showpage book/0001.bin.png -o debug.png TODO ocropus-showpage 'book/????.bin.png' --index index.html

1.3.4. ocropus-visualize-results

1.3.5. ocropus-submit purpose submit a bug report

1.4. character-level recognizer

1.4.1. These are the tools for an older, character level recognizer.

1.4.2. recognition ocropus-lattices purpose usage notes limitations parameters ocropus-ngraphs purpose usage notes parameters limitations TODO

1.4.3. character training ocropus-tsplit purpose ocropus-tleaves purpose ocropus-wtrain purpose TODO ocropus-db purpose subcommands ocropus-cedit purpose TODO

1.4.4. language models ocropus-ngraphs --build purpose

1.4.5. data generation ocropus-align purpose TODO ocropus-lattices --extract purpose usage ocropus-linegen purpose TODO ocropus-db tess2h5 see above

1.4.6. visualization ocropus-ngraphs --print usage purpose

2. files

2.1. intermediate files during recognition

2.1.1. file formats File formats are documented externally; please follow the links. .pseg.png http://goo.gl/9IfXM .rseg.png http://goo.gl/9IfXM .cseg.png http://goo.gl/9IfXM .lattice http://goo.gl/9IfXM .aligned http://goo.gl/9IfXM hOCR output http://goo.gl/zXNrC

2.1.2. directory layout Although many OCRopus commands work on individual files, there is a standard directory layout that some tools expect. It consists of a top level directory containing files of the form 0001.extension for each page, and containing subdirectories of the form 0001, that in turn contain files for each line. book 0001.bin.png 0001.nrm.png 0001.pseg.png 0001

2.2. model files

2.2.1. These are the models responsible for recognition.

2.2.2. .cmodel A pickled python object representing a character classifier. members coutputs(image) -> [(cls,prob),...] sizemode comments training with "ocropus-tsplit" and "ocropus-tleaves" or other kinds of character training tools primarily responsible for recognition of characters cmodels may expect different size normalization of their inputs

2.2.3. .lineest A pickled Python object with two methods returning polynomial models of the baseline and xline. members lineFit(image) -> (bl_poly,xl_poly) comments training with "python -m ocrolib.lineest" command if the script is similar to Latin script, probably doesn't need retraining Line estimators are only needed for scripts in which relative size and position of characters makes a significant difference. For most scripts other than Latin scripts this isn't the case.

2.2.4. .ngraphs A pickled Python object representing conditional probabilities. members all the members are private (the class is only used by ocropus-ngraphs) comments training with "ocropus-ngraphs" command this is easy to train, and needs to be trained even for different corpora recognition performance is quite sensitive to the parameters given to ocropus-ngraphs (e.g., language model weight etc.); this tricky tradeoff between language models and character recognition is true of all OCR engines

2.2.5. .wmodel A pickled Python object estimating the probabilities that there is a space at a particular location. members setLine(image) classifySpace(x) -> (no,yes) comments the training tools for this is currently broken; it will be replaced

3. library

3.1. utility

3.1.1. default a few default paths this is also read by setup.py TODO rename to defaults.py

3.1.2. toplevel decorators etc. intended to be imported at toplevel (very few) in particular, contains the @checks decorator, which checks arguments to Python functions there is a large number of argument checks defined, relevant to image processing and OCR

3.1.3. common reading and writing of images, segmentation files, etc. dealing with file names and directories some general simple image processing functions some unicode-related utilities warnings checks loading and instantiating components TODO move exceptions and annotations to toplevel.py use warning, checks, etc. more consistently use mkpython more consistently remove old make_I... functions move rect_union to sl move drawing functions into separate module make gt_explode/implode into Python codec

3.1.4. sl utilities for dealing with Python tuples-of-slices as rectangles

3.1.5. native a small library from embedding C code inside Python

3.2. image processing

3.2.1. morph morphological operations to completely scipy.ndimage.morphology

3.2.2. improc various image processing operations

3.2.3. lineproc image processing specifically related to text lines

3.2.4. psegutils image processing specifically related to page images

3.3. pattern recognition

3.3.1. patrec various pattern recognition algorithms

3.3.2. mlinear linear classifiers, logistic regression

3.3.3. mlp MLPs and backpropagation

3.4. recognition

3.4.1. wmodel white space models

3.4.2. lineseg line segmentation by connected components line segmentation by dynamic programming cuts

3.4.3. linerec text line recognition

3.4.4. lineest line geometry estimation finds the baseline and xline of text lines trained models of the form name.lineest

3.5. text

3.5.1. lang language and script related data TODO merge with ligatures.py

3.5.2. ligatures encoding and decoding of ligatures as integers, related functions

3.5.3. lattice reads lattice files and augments them with some transitions; used for both alignment and language modeling notes you should usually use linerec.read_lattice and linerec.write_lattice TODO change the name to something less generic

3.5.4. ngraphs

3.5.5. hocr

4. KEY

4.1. this is a command

4.2. this is an obsolete command

4.3. this is an unimplemented comand

4.4. this is a group of modules by function

4.5. this is a module in ocrolib

4.6. this is a file name in the directory hierarchy

4.7. this is a file format

4.8. click on the +/- signs to expand/collapse nodes