Trains Tesseract OCR to recognize special fonts.
This command generates a Tesseract "trainddata" file from one or more fonts, enabling recognition of the character shapes defined by each font. The output from this command is a file named "CODE.traineddata", where CODE is the string value specified in the 'Language Code' property. To make use of the newly-created training, the "CODE.traineddata" file must be copied into the "{Grooper Install Path}\Tesseract\tessdata" directory on each machine where OCR be performed.
Inherits from: Object Command
The following 22 properties are defined.
Property Name | Description |
---|---|
General | |
Name | Type: String
The name used to identify the output training set. Must be at least 4 characters, and not contain spaces or illegal filename characters. After the training file has been generated and installed, the name specified here may be selected in the "Special Fonts" property of the Tesseract OCR engine. It also represents the first segment of the output filename, which takes the form "{Name}.traineddata". As such, the Name may not contain illegal filename characters. |
Tesseract Path | Type: String, Default: C:\Program Files (x86)\Tesseract-OCR
The path to the Tesseract install folder. The Tesseract installer can be downloaded from here. After running the installer, ensure that this property is configured to point at the root of the Tesseract install directory. |
Build Path | Type: String
The base output directory for training data. |
Install Local | Type: Boolean, Default: False
If set to true, the generated training file will installed on the local machine. Enabling this option places a copy of the CODE.traineddata file in "{Grooper Install Path}\Tesseract\tessdata" folder, making it immediately available for testing on the local machine. |
Language Settings | |
Base Language | Type: Culture Data
The language on which to base the new language. |
Copy Config | Type: Boolean, Default: False
If set to true, the generated config file will copied to the local on the local machine. Enabling this option places a copy of the CODE.config file in "{Grooper Install Path}\Tesseract\tessdata" folder. |
Copy Disambiguation Data | Type: Boolean, Default: False
If set to true, the generated disambiguation data file will copied to the local on the local machine. Enabling this option places a copy of the CODE.unicharambigs file in "{Grooper Install Path}\Tesseract\tessdata" folder. |
Copy Language Data | Type: Boolean, Default: False
If set to true, the generated language data files will copied to the local on the local machine. Enabling this option places a copy of the language files (bigram-dawg, number-dawg, punc-dawg, word-dawg, freq-dawg) in "{Grooper Install Path}\Tesseract\tessdata" folder. |
Font Settings | |
Font Names | Type: List of String
The list of fonts to be included in the training. |
Font Sizes | Type: String, Default: 8, 10, 12
The font size used to draw characters on training images. |
Font Styles | Type: TrainingFontStyle, Default: Normal
The font size used to draw characters on training images. |
Fixed Pitch | Type: Boolean, Default: False
Indicates whether or not the font is fixed-pitch. |
Serif | Type: Boolean, Default: False
Indicates whether or not the font contains serifs. |
Training Content | |
Character Set | Type: String
The set of characters which should be included in training. This value is reset each time the Font Names or Language Filter is changed. It initializes to the set of characters which are defined by selected fonts. If a Language Filter is specified, the list of characters is further limited to those which appear on a keyboard in the selected languages. |
Characters In Scope | Type: String
Represents all of the characters from all of the configured fonts. |
Training Content File | Type: String
Specify the path of the training content file. |
Training Page Generation | |
Page Size | Type: Logical Size, Default: 8.5in, 11in
The page size for training images. |
Border Size | Type: String, Default: 0.5in
The border size for training images. |
Resolution | Type: Int32, Default: 300
The resolution at which training images should be created, in dots per inch. |
Line Spacing | Type: Double, Default: 100%
The line spacing to be used in training images, expressed as a percentage of the font size. |
Characters Per Line | Type: Int32, Default: 40
The number of characters to draw on each line in training images. |
Image Degradation IP Profile | Type: IP Profile
An optional IP Profile to be executed on each training image. |
Culture Data, IP Profile, Logical Size, Property Grid, Tesseract OCR