Tesseract OCR - Train Font

Trains Tesseract OCR to recognize special fonts.

Remarks

This command generates a Tesseract "trainddata" file from one or more fonts, enabling recognition of the character shapes defined by each font. The output from this command is a file named "CODE.traineddata", where CODE is the string value specified in the 'Language Code' property. To make use of the newly-created training, the "CODE.traineddata" file must be copied into the "{Grooper Install Path}\Tesseract\tessdata" directory on each machine where OCR be performed.

Inherits from: Object Command

Properties

The following 22 properties are defined.

Property Name	Description
General
Name	Type: String The name used to identify the output training set. Must be at least 4 characters, and not contain spaces or illegal filename characters. After the training file has been generated and installed, the name specified here may be selected in the "Special Fonts" property of the Tesseract OCR engine. It also represents the first segment of the output filename, which takes the form "{Name}.traineddata". As such, the Name may not contain illegal filename characters.
Tesseract Path	Type: String, Default: C:\Program Files (x86)\Tesseract-OCR The path to the Tesseract install folder. The Tesseract installer can be downloaded from here. After running the installer, ensure that this property is configured to point at the root of the Tesseract install directory.
Build Path	Type: String The base output directory for training data.
Install Local	Type: Boolean, Default: False If set to true, the generated training file will installed on the local machine. Enabling this option places a copy of the CODE.traineddata file in "{Grooper Install Path}\Tesseract\tessdata" folder, making it immediately available for testing on the local machine.
Language Settings
Base Language	Type: Culture Data The language on which to base the new language.
Copy Config	Type: Boolean, Default: False If set to true, the generated config file will copied to the local on the local machine. Enabling this option places a copy of the CODE.config file in "{Grooper Install Path}\Tesseract\tessdata" folder.
Copy Disambiguation Data	Type: Boolean, Default: False If set to true, the generated disambiguation data file will copied to the local on the local machine. Enabling this option places a copy of the CODE.unicharambigs file in "{Grooper Install Path}\Tesseract\tessdata" folder.
Copy Language Data	Type: Boolean, Default: False If set to true, the generated language data files will copied to the local on the local machine. Enabling this option places a copy of the language files (bigram-dawg, number-dawg, punc-dawg, word-dawg, freq-dawg) in "{Grooper Install Path}\Tesseract\tessdata" folder.
Font Settings
Font Names	Type: List of String The list of fonts to be included in the training.
Font Sizes	Type: String, Default: 8, 10, 12 The font size used to draw characters on training images.
Font Styles	Type: TrainingFontStyle, Default: Normal The font size used to draw characters on training images.
Fixed Pitch	Type: Boolean, Default: False Indicates whether or not the font is fixed-pitch.
Serif	Type: Boolean, Default: False Indicates whether or not the font contains serifs.
Training Content
Character Set	Type: String The set of characters which should be included in training. This value is reset each time the Font Names or Language Filter is changed. It initializes to the set of characters which are defined by selected fonts. If a Language Filter is specified, the list of characters is further limited to those which appear on a keyboard in the selected languages.
Characters In Scope	Type: String Represents all of the characters from all of the configured fonts.
Training Content File	Type: String Specify the path of the training content file.
Training Page Generation
Page Size	Type: Logical Size, Default: 8.5in, 11in The page size for training images.
Border Size	Type: String, Default: 0.5in The border size for training images.
Resolution	Type: Int32, Default: 300 The resolution at which training images should be created, in dots per inch.
Line Spacing	Type: Double, Default: 100% The line spacing to be used in training images, expressed as a percentage of the font size.
Characters Per Line	Type: Int32, Default: 40 The number of characters to draw on each line in training images.
Image Degradation IP Profile	Type: IP Profile An optional IP Profile to be executed on each training image.

Tesseract OCR - Train Font

Remarks

Properties

See Also