A Data Type defines logic for extracting data values or data structures from the text content of a document.
Data Types are one of the primary building blocks at the foundation of Grooper's data extraction and classification capabilities. They can be used to match simple data values such as dates and phone numbers, or more complex data structures such as key-value pairs and table rows.
Each data type specifies one or more "extractors", along with settings which control how the matches produced by those extractors are collated into a final result set. Different Collation Providers allow Data Types to perform a wide variety of data extraction tasks - from matching individual words or data values to extraction of complex 2-dimensional structures such as address blocks, table rows, or entire sections of information.
Extractors may be defined through the methods outlined below. If extractors are defined using multiple methods, they will be executed in the order shown.
The set of extractors to be created for a given Data Type depends on the task at hand and the Collation Provider in use. The default collation provider, Individual, simply returns all results from all extractors. In this mode, each extractor is typically designed to match a different variation of the target data. For example, in a Data Type designed to capture dates, each extractor might match a different date/time format. (i.e. '01/01/2000', 'January 1, 2000', '01-JAN-2000', etc.)
Other collation providers interpret extractor results as more complex entities such as key-value pairs, arrays, table rows, data regions, and arbitrary data shapes. For a complete list of collation providers, see the 'Collation' property.
Inherits from: Grooper Node
The following 17 properties are defined.
Property Name | Description |
---|---|
General | |
Value Type | Type: Storage Type
Specifies the base type used for interpreting captured data values. If a captured value cannot be converted to the Value Type, it will be excluded from the output by default. The 'Allow Invalid Results' property of the Result Filter can be used to override this behavior. If the specified type is a formattable type, each output value will be reformatted based an a configurable Format Specifier. |
Culture Filter | Type: List of Culture Data
An optional list of cultures supported by this extractor. If this value is empty, the extractor will execute against all documents. Otherwise, the extractor will only execute on documents with a compatible culture code. |
Description | Type: String
Specifies a description for the item. |
Data Extraction | |
Pattern | Type: Data Pattern
Defines an internal Data Pattern which can be used in place of a child Data Format. This property is useful for simple extractions where only a single format needs to be defined. |
Referenced Extractors | Type: List of Extractor Node
Defines an optional list of external extractors to be executed. At runtime, referenced extractors execute after the internal pattern and the direct children have been executed. |
Input Filter | Type: Text Extractor
An optional extractor which restricts the scope of data extraction to a subset of the input. When an Input Filter is specified, extraction logic executes against each match produced by the Input Filter, rather than against the input itself. This mechanism can be used to restrict extraction operations to a subset of the original input. In some cases, input filters are used simplify data extraction logic by restricting the scope to a specific page or section of the document. In other cases, input filters are used to reduce the execution time of fuzzy extractors by limiting the amount of content they execute against. The following examples demonstrate some commonly-used input filters:
|
Exclusion Extractor | Type: Text Extractor
An optional extractor to be used for filtering undesirable results from the result set. Any result which overlaps with an exclusion match will be discarded. |
Subtraction Extractor | Type: Text Extractor
An optional extractor to be used for removing content from output values. If an extractor is specified, it will be executed against each final output value. Any content which matches the extractor will be removed from the output value. If the resulting output value is empty or contains only whitespace characters, the entire output value will be discarded. The extractor specified here MUST be match a contiguous sequence of characters within the text flow. As such, the extractor cannot use any Collation Provider Methods which combine instances geometrically. |
Output | |
Collation | Type: Collation Provider
Defines how extractor results are interpreted and transformed into the final result set. |
Order By | Type: SortOrder, Default: Position
Specifies the sort order of the result set. Can be one of the following values:
|
Direction | Type: SortDirection, Default: Ascending
Specifies whether the result set should be ordered in ascending or descending order. Can be one of the following values:
|
Lookup | Type: Lookup Options
Defines lexicon lookup and fuzzy matching options. |
Result Filter | Type: Result Filter
Specifies options for filtering output instances. |
Result Options | Type: Result Options
Specifies optional processing for each output instance. |
Post Processing | Type: Result Processor
Specifies an optional post-processing operation to the applied to each output instance. |
Deduplication | |
Deduplicate By | Type: DeduplicationMode, Default: None
Specifies the mode to be used for deduplicating overlapping results. Can be one of the following values:
|
Distinct Values | Type: Boolean, Default: False
If True, duplicate values will be eliminated, leaving only the first instance of each distinct value. Note that deduplication is not case-sensitive. |
Command Name | Shortcut Keys | Description | |
---|---|---|---|
Add Multiple Items | Creates multiple items as children of the selected object. | ||
Clear Children | Deletes all children of the selected object(s). | ||
Export to Zip Archive | Exports a set of Grooper nodes to a ZIP archive. | ||
Publish to Grooper Repository | Publishes one or more Nodes to one or more Target Grooper Repositories. | ||
Unpublish | Unpublishes a set of Grooper Nodes to a Target Grooper Repository. |
Tab Name | Description |
---|---|
Data Type - General | Provides a user interface displaying the properties of a Data Type as well as an interface for testing the Data Type using test batch documents. |
Grooper Node - Scripting | Provides script viewing, compilation, management, and basic editing features. |
Grooper Node - Contents | Provides a user interface for viewing and managing the children of a Grooper Node. |
Grooper Node - Advanced | Displays detailed information about Grooper Node objects, and provides administrative functions for managing them. |
Collation Provider, Culture Data, Data Pattern, Data Type, Field Class, Lookup Options, Reference, Result Filter, Result Options, Result Processor, Storage Type, Text Pattern