Defines an extractor which returns all instances of data matching a regular expression.
Includes settings which control how the input will be preprocessed, and how extracted values will be validated and filtered into a final result set.
The following 25 properties are defined.
Property Name | Description | ||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
General | |||||||||||||||||||||||||||||||||||||
Value Type | Type: Storage Type, Default: String
Defines the type of data this extractor will capture. Can be one of the following values:
|
||||||||||||||||||||||||||||||||||||
Mode | Type: ExtractionMode, Default: RegEx
Specifies the extraction mode. Can be one of the following values:
|
||||||||||||||||||||||||||||||||||||
Case Sensitive | Type: Boolean, Default: False
Determines whether the regular expression will be evaluated with case-sensitivity on or off. |
||||||||||||||||||||||||||||||||||||
Preprocessing Options | Type: Text Preprocessor
Specifies options for processing text prior to running the regular expression. |
||||||||||||||||||||||||||||||||||||
Regional Settings | Type: Region Settings, Default: (empty)
Defines multilanguage options used for data extraction. |
||||||||||||||||||||||||||||||||||||
Expression Lexicon | Type: Lexicon
A lookup Lexicon containing key-value pairs to be available as @Variables in the regular expression. Entries in the lexicon should take the form Key=ReplacementValue. For example, consider a lexicon with the following entries: Directions=N|S|W|E This lexicon defines variables named @Direction and @Suffix which can be used in the regular expression: (@Directions) [a-z]+ (@Suffix) At run time, this expands to the following regular expression: (N|S|W|E) [a-z]+ (road|street|boulevard|circle) |
||||||||||||||||||||||||||||||||||||
Referenced Lexicons | Type: List of Lexicon
Defines one or more lexicons whose contents may be referenced as a list using @Variables. Each lexicon referenced here will be available by name as an @Variable in the regular expression. The @Variable for a given lexicon will expand to a value which includes all entries separated by "|" (the regex "OR" operator). For example, consider a lexicon named "Weekdays" with the following entries: Sunday Referencing this lexicon will define a variable named @Weekdays which can be used in the regular expression: (@Weekdays), June \d+ At run time, this expands to the following regular expression: (Sunday|Monday|Tuesday|Wednesday|Thursday|Friday|Saturday), June \d+ |
||||||||||||||||||||||||||||||||||||
Fuzzy Matching Options | |||||||||||||||||||||||||||||||||||||
Minimum Similarity | Type: Double, Default: 90%, Range: 1% - 99%
When Extraction Mode is set to FuzzyRegEx, specifies the minimum similarity for fuzzy matches. |
||||||||||||||||||||||||||||||||||||
Match Mode | Type: FuzzyMatchMode, Default: LeastCost
Defines how multiple overlapping matches are resolved when FuzzyRegEx is in use. Can be one of the following values:
|
||||||||||||||||||||||||||||||||||||
Fuzzy Match Weightings | Type: Fuzzy Match Weightings
When Extraction Mode is set to FuzzyRegEx, specifies the fuzzy match weightings to be used. |
||||||||||||||||||||||||||||||||||||
Regular Expression | |||||||||||||||||||||||||||||||||||||
Value Pattern | Type: String
A regular expression pattern which identifies data to be extracted. Regular expressions generally take the form of a Positive Character Group in square brackets, followed by the Quantifier in curly braces. For example:
|
||||||||||||||||||||||||||||||||||||
Look Ahead Pattern | Type: String
A regular expression defining a pattern which must occur immediately before the main pattern. This value will not be returned to the output. |
||||||||||||||||||||||||||||||||||||
Look Behind Pattern | Type: String
A regular expression defining a pattern which must occur immediately after the main pattern. This value will not be returned to the output. |
||||||||||||||||||||||||||||||||||||
Output Format | Type: String
An optional format string which indicates the output format for the data. The output format can contain (a) literal characters and (b) placeholders for groups captured in the regular expression. Placeholders take the general form {GroupName}, and can be expanded to include a typecast and format {GroupName:TypeCast:FormatSpecifier}. Examples:
Valid typecasts include DateTime, Decimal, Double, Integer, and String. If an extracted value cannot be converted to the specified type, the value will be excluded from the output. Two special typecasts are provided to assist with translation of values captured with the @Number and @Alpha variables. A typecast of 'Number' will convert all alpha characters which resemble numbers to their numeric equivalents. A typecase of 'Alpha' will perform the exact inverse of this operation, converting all numeric characters which resemble alpha characters to their alpha equivalents. GroupNameThe GroupName must reference a named group defined within the regular expression, be limited to the [0-9A-Z_] character set, and it's length cannot exceed 64. FormatSpecifierA valid .Net format specifier for the type indicated in the typecast. Please see the following links for complete documentation: Commonly-Used Format Strings
|
||||||||||||||||||||||||||||||||||||
Lookup and Translation | |||||||||||||||||||||||||||||||||||||
Lookup Options | Type: Lookup Options
Defines lookup, translation, and fuzzy matching options for the entire captured value. To specify lookup options for a named group within the regular expression, or for the individual components of an nGram, use the Group Lookup Options property. |
||||||||||||||||||||||||||||||||||||
Group Lookup Options | Type: List of Group Lookup Options
Defines lookup, translation, and fuzzy matching options for named groups within the regular expression and individual components of nGrams. To specify lookup options for the captured value as a whole, use the 'Lookup Options' property. |
||||||||||||||||||||||||||||||||||||
Local Vocabulary Entries | Type: String
Specifies the local vocabulary entries stored in Lookup Options. |
||||||||||||||||||||||||||||||||||||
nGram Options | |||||||||||||||||||||||||||||||||||||
nGram Size | Type: Int32, Default: 1, Range: 1 - 5
When set to a value greater than 1, enables nGram capture mode. The output will include all possible cominations of N contiguous elements. "Contiguous" is defined as any two matches where the nGram Separator expression matches the text between them. An nGram is a sequence of words: 1 word is a unigram, 2 words are a bigram, 3 words are a trigram, and so on. Example:
|
||||||||||||||||||||||||||||||||||||
nGram Separator | Type: String
When nGram extraction is active, this regular expression defines allowable separators. If the pattern is blank, the default behavior is to allow nGrams which are separated by 0 characters or 1 space character. |
||||||||||||||||||||||||||||||||||||
nGram Format String | Type: String
When nGram extraction is active, defines an optional format string which transforms the final output value. A .Net composite format string where {0} indicates the entire match, {1} indicates nGram element 1, {2} indicates nGram element 2, and so on. For example, an nGram match on "quick brown fox" with the format string "phrase_{1}_{2}_{3}" would produce the output value "phrase_quick_brown_fox". |
||||||||||||||||||||||||||||||||||||
Output Options | |||||||||||||||||||||||||||||||||||||
Result Filter | Type: Result Filter
Specifies optional criteria for filtering output instances. |
||||||||||||||||||||||||||||||||||||
Result Options | Type: Result Options
Specifies optional processing for each output instance. |
||||||||||||||||||||||||||||||||||||
Include Character Confidence | Type: Boolean, Default: False
If enabled, character-level OCR confidence will be factored into the final output confidence. |
||||||||||||||||||||||||||||||||||||
Align Output | Type: Boolean, Default: False
If an Output Format is specified and this property is True, then the output value will be re-aligned with the original OCR results. In cases where the Output Format is being using for correction, ensures that literal characters from the output format are lined up with their closest OCR counterpart. |
||||||||||||||||||||||||||||||||||||
Restrict Zone | Type: Boolean, Default: False
If enabled, restricts the highlight zone for the extracted data to the zone covered by the data elements used in the output format. This is useful in situations where surrounding data is used to identify the target data, but is not actually part of the field value. |
Fuzzy Match Weightings, Group Lookup Options, Lexicon, Lookup Options, Region Settings, Result Filter, Result Options, Storage Type, Text Preprocessor
Data Format, Data Type, Embedded Extractor, Registration Zone, Text Pattern