Tesseract OCR - Train Font

Trains Tesseract OCR to recognize special fonts.

Remarks

This command generates a Tesseract "trainddata" file from one or more fonts, enabling recognition of the character shapes defined by each font. The output from this command is a file named "CODE.traineddata", where CODE is the string value specified in the 'Language Code' property. To make use of the newly-created training, the "CODE.traineddata" file must be copied into the "{Grooper Install Path}\Tesseract\tessdata" directory on each machine where OCR be performed.

Inherits from: Object Command

Properties

The following 22 properties are defined.

Property Name Description
General
Name Type: String

The name used to identify the output training set. Must be at least 4 characters, and not contain spaces or illegal filename characters. After the training file has been generated and installed, the name specified here may be selected in the "Special Fonts" property of the Tesseract OCR engine. It also represents the first segment of the output filename, which takes the form "{Name}.traineddata". As such, the Name may not contain illegal filename characters.

Tesseract Path Type: String, Default: C:\Program Files (x86)\Tesseract-OCR

The path to the Tesseract install folder. The Tesseract installer can be downloaded from here. After running the installer, ensure that this property is configured to point at the root of the Tesseract install directory.

Build Path Type: String

The base output directory for training data.

Install Local Type: Boolean, Default: False

If set to true, the generated training file will installed on the local machine. Enabling this option places a copy of the CODE.traineddata file in "{Grooper Install Path}\Tesseract\tessdata" folder, making it immediately available for testing on the local machine.

Language Settings
Base Language Type: Culture Data

The language on which to base the new language.

Copy Config Type: Boolean, Default: False

If set to true, the generated config file will copied to the local on the local machine. Enabling this option places a copy of the CODE.config file in "{Grooper Install Path}\Tesseract\tessdata" folder.

Copy Disambiguation Data Type: Boolean, Default: False

If set to true, the generated disambiguation data file will copied to the local on the local machine. Enabling this option places a copy of the CODE.unicharambigs file in "{Grooper Install Path}\Tesseract\tessdata" folder.

Copy Language Data Type: Boolean, Default: False

If set to true, the generated language data files will copied to the local on the local machine. Enabling this option places a copy of the language files (bigram-dawg, number-dawg, punc-dawg, word-dawg, freq-dawg) in "{Grooper Install Path}\Tesseract\tessdata" folder.

Font Settings
Font Names Type: List of String

The list of fonts to be included in the training.

Font Sizes Type: String, Default: 8, 10, 12

The font size used to draw characters on training images.

Font Styles Type: TrainingFontStyle, Default: Normal

The font size used to draw characters on training images.

Fixed Pitch Type: Boolean, Default: False

Indicates whether or not the font is fixed-pitch.

Serif Type: Boolean, Default: False

Indicates whether or not the font contains serifs.

Training Content
Character Set Type: String

The set of characters which should be included in training. This value is reset each time the Font Names or Language Filter is changed. It initializes to the set of characters which are defined by selected fonts. If a Language Filter is specified, the list of characters is further limited to those which appear on a keyboard in the selected languages.

Characters In Scope Type: String

Represents all of the characters from all of the configured fonts.

Training Content File Type: String

Specify the path of the training content file.

Training Page Generation
Page Size Type: Logical Size, Default: 8.5in, 11in

The page size for training images.

Border Size Type: String, Default: 0.5in

The border size for training images.

Resolution Type: Int32, Default: 300

The resolution at which training images should be created, in dots per inch.

Line Spacing Type: Double, Default: 100%

The line spacing to be used in training images, expressed as a percentage of the font size.

Characters Per Line Type: Int32, Default: 40

The number of characters to draw on each line in training images.

Image Degradation IP Profile Type: IP Profile

An optional IP Profile to be executed on each training image.

See Also

Culture Data, IP Profile, Logical Size, Property Grid, Tesseract OCR