Data Pattern

Defines an extractor which returns all instances of data matching a regular expression.

Remarks

Includes settings which control how the input will be preprocessed, and how extracted values will be validated and filtered into a final result set.

Properties

The following 25 properties are defined.

Property Name Description
General
Value Type Type: Storage Type, Default: String

Defines the type of data this extractor will capture. Can be one of the following values:

  • Boolean - Represents a Boolean (true or false) value.
  • DateTime - Represents an instant in time, typically expressed as a date and/or time of day.
  • Decimal - Represents a decimal value.
  • Double - Represents a 64-bit floating point value.
  • GUID - Represents a globally unique identifier (GUID).
  • Int16 - Represents a 16-bit integer value.
  • Int32 - Represents a 32-bit integer value.
  • Int64 - Represents a 64-bit integer value.
  • String - String values can store any type of text information.
  • URL - A Uniform Resource Locator (URL) is a string of characters used to identify a web resource, such as a web page on an HTTP server, or a file on an FTP server.
If a captured value cannot be converted to the base type, it will be excluded from the output, unless the Allow Invalid Results property of the Result Filter is set to True.

Mode Type: ExtractionMode, Default: RegEx

Specifies the extraction mode. Can be one of the following values:

  • RegEx - Normal regular expression mode. Finds all instances which are a 100% match for the regular expression. RegEx mode supports the full syntax and feature set of Microsoft .Net Framework Regular Expressions.
  • FuzzyRegEx - Fuzzy regular expression mode. Finds all instances which match the regular expression to a specific percentage of similarity.
  • FuzzyList - Fuzzy list mode. Finds all instances which match an entry in the vocabulary. In this mode, no Value Pattern is required, since the values to be matched are specified as lexicon entries.

Case Sensitive Type: Boolean, Default: False

Determines whether the regular expression will be evaluated with case-sensitivity on or off.

Preprocessing Options Type: Text Preprocessor

Specifies options for processing text prior to running the regular expression.

Regional Settings Type: Region Settings, Default: (empty)

Defines multilanguage options used for data extraction.

Expression Lexicon Type: Lexicon

A lookup Lexicon containing key-value pairs to be available as @Variables in the regular expression. Entries in the lexicon should take the form Key=ReplacementValue. For example, consider a lexicon with the following entries:

Directions=N|S|W|E
Suffix=road|street|boulevard|circle

This lexicon defines variables named @Direction and @Suffix which can be used in the regular expression:

(@Directions) [a-z]+ (@Suffix)

At run time, this expands to the following regular expression:

(N|S|W|E) [a-z]+ (road|street|boulevard|circle)

Referenced Lexicons Type: List of Lexicon

Defines one or more lexicons whose contents may be referenced as a list using @Variables. Each lexicon referenced here will be available by name as an @Variable in the regular expression. The @Variable for a given lexicon will expand to a value which includes all entries separated by "|" (the regex "OR" operator). For example, consider a lexicon named "Weekdays" with the following entries:

Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday

Referencing this lexicon will define a variable named @Weekdays which can be used in the regular expression:

(@Weekdays), June \d+

At run time, this expands to the following regular expression:

(Sunday|Monday|Tuesday|Wednesday|Thursday|Friday|Saturday), June \d+

Fuzzy Matching Options
Minimum Similarity Type: Double, Default: 90%, Range: 1% - 99%

When Extraction Mode is set to FuzzyRegEx, specifies the minimum similarity for fuzzy matches.

Match Mode Type: FuzzyMatchMode, Default: LeastCost

Defines how multiple overlapping matches are resolved when FuzzyRegEx is in use. Can be one of the following values:

  • LeastCost - Matches the least cost (highest confidence) result. If the confidence of two results are equal, then the longer result will be matched. If both results also have the same length, then the item appearing first in the text flow will be matched. For example, let’s say you have a pattern [0-9]{5,6} and you have a value of 10000O. You should get a hit on 10000 at 100% and a hit on 100000 (depending on your Fuzzy Match Weightings) at a lower percentage. In LeastCost mode, you will only capture the 10000 because it has a higher confidence than converting the O to a 0.
  • BestValue - Matches the longest result which is above the minimum confidence. If two overlapping results have the same length, then the result with the highest confidence will be matched. If both results have equal confidence, then the result appearing first in the text flow will be matched. For example, let’s say you have a pattern [0-9]{5,6} and you have a value of 10000O. You should get a hit on 10000 at 100% and a hit on 100000 (depending on your Fuzzy Match Weightings) at a lower percentage. In BestValue mode, if it is above the minimum confidence, your return value will be 100000 because it is the longer of the 2 values.

Fuzzy Match Weightings Type: Fuzzy Match Weightings

When Extraction Mode is set to FuzzyRegEx, specifies the fuzzy match weightings to be used.

Regular Expression
Value Pattern Type: String

A regular expression pattern which identifies data to be extracted. Regular expressions generally take the form of a Positive Character Group in square brackets, followed by the Quantifier in curly braces. For example:

  • [0-9]{5} will find all numeric values with a length of 5 characters.
  • [0-9]{5,8} will find all numeric values with a length of 5 to 8 characters.
  • [A-Z]{3,12} will find all alpha values with a length of 3 to 12 characters.
  • [0-9A-Z]{6} will find all alphanumeric values with a length of 6.
Grooper's regular expression implementation is based on Microsoft .Net Framework regular expressions, which are extensively discussed in Microsoft documentation. See .Net Regular Expressions or Regular Expression Language - Quick Reference for a good starting point.

Look Ahead Pattern Type: String

A regular expression defining a pattern which must occur immediately before the main pattern. This value will not be returned to the output.

Look Behind Pattern Type: String

A regular expression defining a pattern which must occur immediately after the main pattern. This value will not be returned to the output.

Output Format Type: String

An optional format string which indicates the output format for the data.

The output format can contain (a) literal characters and (b) placeholders for groups captured in the regular expression. Placeholders take the general form {GroupName}, and can be expanded to include a typecast and format {GroupName:TypeCast:FormatSpecifier}. Examples:

  • {LastName}, {FirstName} - Outputs 'Smith, John' in a case where the value of LastName is 'Smith' and the value of FirstName is 'John'.
  • {ItemNo:Integer:0000} - Outputs '0192' in a case where the value of 'ItemNo' is '192'.
TypeCast

Valid typecasts include DateTime, Decimal, Double, Integer, and String. If an extracted value cannot be converted to the specified type, the value will be excluded from the output. Two special typecasts are provided to assist with translation of values captured with the @Number and @Alpha variables. A typecast of 'Number' will convert all alpha characters which resemble numbers to their numeric equivalents. A typecase of 'Alpha' will perform the exact inverse of this operation, converting all numeric characters which resemble alpha characters to their alpha equivalents.

GroupName

The GroupName must reference a named group defined within the regular expression, be limited to the [0-9A-Z_] character set, and it's length cannot exceed 64.

FormatSpecifier

A valid .Net format specifier for the type indicated in the typecast. Please see the following links for complete documentation:

Commonly-Used Format Strings

TypeSpecifierDescriptionExample
DateTimedShort date format6/15/2009
DateTimeDLong date formatMonday, June 15, 2009
DateTimefFull date/time (short time) Monday, June 15, 2009 1:45 PM
DateTimeFFull date/time (long time)Monday, June 15, 2009 1:45:30 PM
Numericc0Currency (Precision 0)$123
Numericc2Currency (Precision 2)$123.45
Numericn0Number (Precision 0)123
Numericn2Number (Precision 2)123.45

Lookup and Translation
Lookup Options Type: Lookup Options

Defines lookup, translation, and fuzzy matching options for the entire captured value. To specify lookup options for a named group within the regular expression, or for the individual components of an nGram, use the Group Lookup Options property.

Group Lookup Options Type: List of Group Lookup Options

Defines lookup, translation, and fuzzy matching options for named groups within the regular expression and individual components of nGrams. To specify lookup options for the captured value as a whole, use the 'Lookup Options' property.

Local Vocabulary Entries Type: String

Specifies the local vocabulary entries stored in Lookup Options.

nGram Options
nGram Size Type: Int32, Default: 1, Range: 1 - 5

When set to a value greater than 1, enables nGram capture mode. The output will include all possible cominations of N contiguous elements. "Contiguous" is defined as any two matches where the nGram Separator expression matches the text between them. An nGram is a sequence of words: 1 word is a unigram, 2 words are a bigram, 3 words are a trigram, and so on. Example:

  • Input: The quick brown fox jumped over the log.
  • Pattern: \w+
  • Output - nGram Size 1: The, quick, brown, fox, jumped, over, the, log
  • Output - nGram Size 3: The quick brown, quick brown fox, brown fox jumped, fox jumped over, jumped over the, over the log.
When nGram capture mode is enabled, settings defined in the Lookup Options property will apply to the entire captured value. Lexicon validation can be applied to individual components of the nGram in Group Lookup Options using the following special group names:
  • nGrams - Applies to all components of the nGram.
  • nGram1 - Applies to component 1 of the nGram.
  • nGram2 - Applies to component 2 of the nGram.
  • nGram3 - Applies to component 3 of the nGram.
  • nGram4 - Applies to component 4 of the nGram.
  • nGram5 - Applies to component 5 of the nGram.

nGram Separator Type: String

When nGram extraction is active, this regular expression defines allowable separators. If the pattern is blank, the default behavior is to allow nGrams which are separated by 0 characters or 1 space character.

nGram Format String Type: String

When nGram extraction is active, defines an optional format string which transforms the final output value. A .Net composite format string where {0} indicates the entire match, {1} indicates nGram element 1, {2} indicates nGram element 2, and so on. For example, an nGram match on "quick brown fox" with the format string "phrase_{1}_{2}_{3}" would produce the output value "phrase_quick_brown_fox".

Output Options
Result Filter Type: Result Filter

Specifies optional criteria for filtering output instances.

Result Options Type: Result Options

Specifies optional processing for each output instance.

Include Character Confidence Type: Boolean, Default: False

If enabled, character-level OCR confidence will be factored into the final output confidence.

Align Output Type: Boolean, Default: False

If an Output Format is specified and this property is True, then the output value will be re-aligned with the original OCR results. In cases where the Output Format is being using for correction, ensures that literal characters from the output format are lined up with their closest OCR counterpart.

Restrict Zone Type: Boolean, Default: False

If enabled, restricts the highlight zone for the extracted data to the zone covered by the data elements used in the output format. This is useful in situations where surrounding data is used to identify the target data, but is not actually part of the field value.

See Also

Fuzzy Match Weightings, Group Lookup Options, Lexicon, Lookup Options, Region Settings, Result Filter, Result Options, Storage Type, Text Preprocessor

Used By

Data Format, Data Type, Embedded Extractor, Registration Zone, Text Pattern