Grooper.Core.DataPattern

Defines an extractor which returns all instances of data matching a regular expression. Includes settings which control how the input will be preprocessed, and how extracted values will be validated and filtered into a final result set.


Inherits from: Grooper.EmbeddedObject

Constructors

Signature Description
New (Owner As ConnectedObject)
Parameters
Owner
          Type: ConnectedObject
          

Fields

Field Name Field Type Description
Database As Grooper.GrooperDb Grooper.GrooperDb
ExpressionLexiconId As System.Guid System.Guid
ReferencedLexiconIds As System.Collections.Generic.List`1[[System.Guid, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]] System.Collections.Generic.List(Of T)

Properties

Property Name Property Type Description
AlignOutput System.Boolean If an Output Format is specified and this property is True, then the output value will be re-aligned with the original OCR results. In cases where the Output Format is being using for correction, ensures that literal characters from the output format are lined up with their closest OCR counterpart.
CaseSensitive System.Boolean Determines whether the regular expression will be evaluated with case-sensitivity on or off.
ExpressionLexicon Grooper.Core.Lexicon A lookup Lexicon containing key-value pairs to be available as @Variables in the regular expression. Entries in the lexicon should take the form Key=ReplacementValue. For example, consider a lexicon with the following entries:

Directions=N|S|W|E
Suffix=road|street|boulevard|circle

This lexicon defines variables named @Direction and @Suffix which can be used in the regular expression:

(@Directions) [a-z]+ (@Suffix)

At run time, this expands to the following regular expression:

(N|S|W|E) [a-z]+ (road|street|boulevard|circle)

Filter Grooper.Core.InstanceFilter Specifies optional criteria for filtering output instances.
FuzzyMatchWeightings Grooper.Core.FuzzyMatchWeightings When Extraction Mode is set to FuzzyRegEx, specifies the fuzzy match weightings to be used.
GroupOptions System.Collections.Generic.List(Of T) Defines lookup, translation, and fuzzy matching options for named groups within the regular expression and individual components of nGrams. To specify lookup options for the captured value as a whole, use the 'Lookup Options' property.
HasReferenceProperties System.Boolean Returns true if the object has properties which reference Grooper Node objects.
IncludeCharacterConfidence System.Boolean If enabled, character-level OCR confidence will be factored into the final output confidence.
IsEmpty System.Boolean Returns true if all properties with a ViewableAttribute are set to their default value.
IsWriteable System.Boolean Returns true if the object is writable, or false if it is not.
ListContent System.String Specifies the local vocabulary entries stored in Lookup Options.
LookAheadPattern System.String A regular expression defining a pattern which must occur immediately before the main pattern. This value will not be returned to the output.
LookBehindPattern System.String A regular expression defining a pattern which must occur immediately after the main pattern. This value will not be returned to the output.
MainGroupOptions Grooper.Core.DataPattern.LookupOptions Defines lookup, translation, and fuzzy matching options for the entire captured value. To specify lookup options for a named group within the regular expression, or for the individual components of an nGram, use the Group Lookup Options property.
MatchMode Grooper.Core.FuzzyRegEx.FuzzyMatchMode Defines how multiple overlapping matches are resolved when FuzzyRegEx is in use.Can be one of the following values:
  • LeastCost: Matches the least cost (highest confidence) result. If the confidence of two results are equal, then the longer result will be matched. If both results also have the same length, then the item appearing first in the text flow will be matched. For example, let’s say you have a pattern [0-9]{5,6} and you have a value of 10000O. You should get a hit on 10000 at 100% and a hit on 100000 (depending on your Fuzzy Match Weightings) at a lower percentage. In LeastCost mode, you will only capture the 10000 because it has a higher confidence than converting the O to a 0.
  • BestValue: Matches the longest result which is above the minimum confidence. If two overlapping results have the same length, then the result with the highest confidence will be matched. If both results have equal confidence, then the result appearing first in the text flow will be matched. For example, let’s say you have a pattern [0-9]{5,6} and you have a value of 10000O. You should get a hit on 10000 at 100% and a hit on 100000 (depending on your Fuzzy Match Weightings) at a lower percentage. In BestValue mode, if it is above the minimum confidence, your return value will be 100000 because it is the longer of the 2 values.
MinimumSimilarity System.Double When Extraction Mode is set to FuzzyRegEx, specifies the minimum similarity for fuzzy matches.
Mode Grooper.Core.DataPattern.ExtractionMode Specifies the extraction mode.Can be one of the following values:
  • RegEx: Normal regular expression mode. Finds all instances which are a 100% match for the regular expression. RegEx mode supports the full syntax and feature set of Microsoft .Net Framework Regular Expressions.
  • FuzzyRegEx: Fuzzy regular expression mode. Finds all instances which match the regular expression to a specific percentage of similarity. FuzzyRegEx mode supports most of the syntax and features of RegEx mode, with a handful of exceptions noted below.

    Processing time for FuzzyRegEx is proportional to the perplexity of the regular expression. Perplexity is the number of possible permutations of the pattern, which in turn defines the number of passes which must be made through the content.

    For example, A{1,2}B{1,2}C{1,2} has a perplexity of 2 * 2 * 2 = 8. (ABC, ABCC, ABBC, ABBCC, AABC, AABCC, AABBC, AABBCC). If we allow each element to range from {1,4}, the perplexity is 4 * 4 * 4 = 64. The expression [0-9]{4} (miles|kilometers) has a perplexity of 2. If we change it to [0-9]{1,5} (miles|kilometers), the perplexity becomes 10.

    Note that FuzzyRegEx supports an option which is unavailable in RegEx. (?r) will turn on required mode, and (?-r) will turn it off. At the start of a FuzzyRegEx, required mode always defaults to off. Once turned on, required mode will stay on until it is turned off. This mechanism can be used, for example, to require the start of a new line. The syntax to accomplish this would be be (?r)\n(?-r).

    The following regular expression features are NOT supported in FuzzyRegEx mode:

  • FuzzyList: Fuzzy list mode. Finds all instances which match an entry in the vocabulary. In this mode, no Value Pattern is required, since the values to be matched are specified as lexicon entries. To configure fuzzy list mode, edit the Lookup Options as follows: (1) define one or more vocabulary entries; and (2) specify the required match percentage. The extractor will return all instances which match an entry in the lexicon to the specified degree of similarity.
nGramFormatString System.String When nGram extraction is active, defines an optional format string which transforms the final output value. A .Net composite format string where {0} indicates the entire match, {1} indicates nGram element 1, {2} indicates nGram element 2, and so on. For example, an nGram match on "quick brown fox" with the format string "phrase_{1}_{2}_{3}" would produce the output value "phrase_quick_brown_fox".
nGramSize System.Int32 When set to a value greater than 1, enables nGram capture mode. The output will include all possible cominations of N contiguous elements. "Contiguous" is defined as any two matches where the nGram Separator expression matches the text between them. An nGram is a sequence of words: 1 word is a unigram, 2 words are a bigram, 3 words are a trigram, and so on. Example:
  • Input: The quick brown fox jumped over the log.
  • Pattern: \w+
  • Output - nGram Size 1: The, quick, brown, fox, jumped, over, the, log
  • Output - nGram Size 3: The quick brown, quick brown fox, brown fox jumped, fox jumped over, jumped over the, over the log.
When nGram capture mode is enabled, settings defined in the Lookup Options property will apply to the entire captured value. Lexicon validation can be applied to individual components of the nGram in Group Lookup Options using the following special group names:
  • nGrams - Applies to all components of the nGram.
  • nGram1 - Applies to component 1 of the nGram.
  • nGram2 - Applies to component 2 of the nGram.
  • nGram3 - Applies to component 3 of the nGram.
  • nGram4 - Applies to component 4 of the nGram.
  • nGram5 - Applies to component 5 of the nGram.
OutputFormat System.String An optional format string which indicates the output format for the data.

The output format can contain (a) literal characters and (b) placeholders for groups captured in the regular expression. Placeholders take the general form {GroupName}, and can be expanded to include a typecast and format {GroupName:TypeCast:FormatSpecifier}. Examples:

  • {LastName}, {FirstName} - Outputs 'Smith, John' in a case where the value of LastName is 'Smith' and the value of FirstName is 'John'.
  • {ItemNo:Integer:0000} - Outputs '0192' in a case where the value of 'ItemNo' is '192'.
TypeCast

Valid typecasts include DateTime, Decimal, Double, Integer, and String. If an extracted value cannot be converted to the specified type, the value will be excluded from the output. Two special typecasts are provided to assist with translation of values captured with the @Number and @Alpha variables. A typecast of 'Number' will convert all alpha characters which resemble numbers to their numeric equivalents. A typecase of 'Alpha' will perform the exact inverse of this operation, converting all numeric characters which resemble alpha characters to their alpha equivalents.

GroupName

The GroupName must reference a named group defined within the regular expression, be limited to the [0-9A-Z_] character set, and it's length cannot exceed 64.

FormatSpecifier

A valid .Net format specifier for the type indicated in the typecast. Please see the following links for complete documentation:

Commonly-Used Format Strings

TypeSpecifierDescriptionExample
DateTimedShort date format6/15/2009
DateTimeDLong date formatMonday, June 15, 2009
DateTimefFull date/time (short time) Monday, June 15, 2009 1:45 PM
DateTimeFFull date/time (long time)Monday, June 15, 2009 1:45:30 PM
Numericc0Currency (Precision 0)$123
Numericc2Currency (Precision 2)$123.45
Numericn0Number (Precision 0)123
Numericn2Number (Precision 2)123.45
Owner Grooper.ConnectedObject Returns the node that owns the connected object, if any.
OwnerNode Grooper.GrooperNode Returns the node that owns the connected object, if any.
PreprocessingOptions Grooper.Core.TextPreprocessor Specifies options for processing text prior to running the regular expression.
ReferencedLexicons System.Collections.Generic.List(Of T) Defines one or more lexicons whose contents may be referenced as a list using @Variables. Each lexicon referenced here will be available by name as an @Variable in the regular expression. The @Variable for a given lexicon will expand to a value which includes all entries separated by "|" (the regex "OR" operator). For example, consider a lexicon named "Weekdays" with the following entries:

Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday

Referencing this lexicon will define a variable named @Weekdays which can be used in the regular expression:

(@Weekdays), June \d+

At run time, this expands to the following regular expression:

(Sunday|Monday|Tuesday|Wednesday|Thursday|Friday|Saturday), June \d+

RegionalSettings Grooper.Core.DataPattern.RegionSettings Defines multilanguage options used for data extraction.
RestrictZone System.Boolean If enabled, restricts the highlight zone for the extracted data to the zone covered by the data elements used in the output format. This is useful in situations where surrounding data is used to identify the target data, but is not actually part of the field value.
ResultOptions Grooper.Core.ResultOptions Specifies optional processing for each output instance.
Root Grooper.GrooperRoot Returns the root node
SeparatorExpression System.String When nGram extraction is active, this regular expression defines allowable separators. If the pattern is blank, the default behavior is to allow nGrams which are separated by 0 characters or 1 space character.
ValuePattern System.String A regular expression pattern which identifies data to be extracted. Regular expressions generally take the form of a Positive Character Group in square brackets, followed by the Quantifier in curly braces. For example:
  • [0-9]{5} will find all numeric values with a length of 5 characters.
  • [0-9]{5,8} will find all numeric values with a length of 5 to 8 characters.
  • [A-Z]{3,12} will find all alpha values with a length of 3 to 12 characters.
  • [0-9A-Z]{6} will find all alphanumeric values with a length of 6.
Grooper's regular expression implementation is based on Microsoft .Net Framework regular expressions, which are extensively discussed in Microsoft documentation. See .Net Regular Expressions or Regular Expression Language - Quick Reference for a good starting point.
ValueType Grooper.Core.StorageType Defines the type of data this extractor will capture. If a captured value cannot be converted to the base type, it will be excluded from the output, unless the Allow Invalid Results property of the Result Filter is set to True.

Methods

Method Name Description
ExecuteExpression(Source As DataInstance, Expression As String, MaxResults As Int32) As DataInstanceCollection
Parameters
Source
          Type: DataInstance
          
 
Expression
          Type: String
          
 
MaxResults
          Type: Int32
          
FindInstances(Input As DataInstance) As DataInstanceCollection
Parameters
Input
          Type: DataInstance
          
GetListPattern(Culture As CultureData) As String
Parameters
Culture
          Type: CultureData
          
GetProperties() As PropertyDescriptorCollection
GetReferences() As List(Of GrooperNode) Returns a list of GrooperNode objects referenced in the properties of this object.
IsPropertyEnabled(PropertyName As String) As Nullable(Of Boolean) Defines whether a property is currently enabled.
Parameters
PropertyName
          Type: String
          The name of the property to determine the enabled state for.
IsPropertyVisible(PropertyName As String) As Nullable(Of Boolean) Defines whether a property is currently visible.
Parameters
PropertyName
          Type: String
          The name of the property to determine the visible state for.
IsType(Type As Type) As Boolean Returns true if the object is of the type specified, or if it derives from the type specfied.
Parameters
Type
          Type: Type
          The type to check.
ProcessPattern(Culture As CultureData, ValidationMode As Boolean) As String Substitutes variable values for variable names in the pattern.
Parameters
Culture
          Type: CultureData
          
 
ValidationMode
          Type: Boolean
          
ProcessPatternString(Expression As String, Culture As CultureData) As String
Parameters
Expression
          Type: String
          
 
Culture
          Type: CultureData
          
Serialize() As String Serializes the object.
SetDatabase(Database As GrooperDb) Sets the database connection of the object.
Parameters
Database
          Type: GrooperDb
          
SetOwner(Owner As ConnectedObject, SkipInitialization As Boolean) Sets the owner of the connected object with another object that implements the IConnected interface.
Parameters
Owner
          Type: ConnectedObject
          
 
SkipInitialization
          Type: Boolean
          
ToString() As String Returns a string value representation of the connected object.
Uninitialize() Destroys the regular expression.
ValidatePattern() As ValidationErrorList
ValidateProperties() As ValidationErrorList Validates the properties of the object, returning a list of validation errors.
ValidateProps() As ValidationErrorList