Data Type

A Data Type defines logic for extracting data values or data structures from the text content of a document.

Remarks

Data Types are one of the primary building blocks at the foundation of Grooper's data extraction and classification capabilities. They can be used to match simple data values such as dates and phone numbers, or more complex data structures such as key-value pairs and table rows.

Each data type specifies one or more "extractors", along with settings which control how the matches produced by those extractors are collated into a final result set. Different Collation Providers allow Data Types to perform a wide variety of data extraction tasks - from matching individual words or data values to extraction of complex 2-dimensional structures such as address blocks, table rows, or entire sections of information.

Defining Extractors

Extractors may be defined through the methods outlined below. If extractors are defined using multiple methods, they will be executed in the order shown.

Extractors and Collation Providers

The set of extractors to be created for a given Data Type depends on the task at hand and the Collation Provider in use. The default collation provider, Individual, simply returns all results from all extractors. In this mode, each extractor is typically designed to match a different variation of the target data. For example, in a Data Type designed to capture dates, each extractor might match a different date/time format. (i.e. '01/01/2000', 'January 1, 2000', '01-JAN-2000', etc.)

Other collation providers interpret extractor results as more complex entities such as key-value pairs, arrays, table rows, data regions, and arbitrary data shapes. For a complete list of collation providers, see the 'Collation' property.

Inherits from: Grooper Node

Properties

The following 17 properties are defined.

Property Name Description
General
Value Type Type: Storage Type

Specifies the base type used for interpreting captured data values. If a captured value cannot be converted to the Value Type, it will be excluded from the output by default. The 'Allow Invalid Results' property of the Result Filter can be used to override this behavior. If the specified type is a formattable type, each output value will be reformatted based an a configurable Format Specifier.

Culture Filter Type: List of Culture Data

An optional list of cultures supported by this extractor. If this value is empty, the extractor will execute against all documents. Otherwise, the extractor will only execute on documents with a compatible culture code.

Description Type: String

Specifies a description for the item.

Data Extraction
Pattern Type: Data Pattern

Defines an internal Data Pattern which can be used in place of a child Data Format. This property is useful for simple extractions where only a single format needs to be defined.

Referenced Extractors Type: List of Extractor Node

Defines an optional list of external extractors to be executed. At runtime, referenced extractors execute after the internal pattern and the direct children have been executed.

Input Filter Type: Text Extractor

An optional extractor which restricts the scope of data extraction to a subset of the input. When an Input Filter is specified, extraction logic executes against each match produced by the Input Filter, rather than against the input itself. This mechanism can be used to restrict extraction operations to a subset of the original input.

In some cases, input filters are used simplify data extraction logic by restricting the scope to a specific page or section of the document. In other cases, input filters are used to reduce the execution time of fuzzy extractors by limiting the amount of content they execute against.

The following examples demonstrate some commonly-used input filters:

  • Restrict to the first page of the document: ^[^f]+
  • Restrict to the last page of the document: [^f]+$
  • Restrict to first 5 lines of the document: ^([^\r\n]+\r\n){5}
  • Restrict to top 3 lines of each page: (^|\f)([^\r\n]+\r\n){3}
  • Restrict to the PERSONAL INFO section: \r\nPERSONAL INFO[^\f]+\r\nEDUCATION

Exclusion Extractor Type: Text Extractor

An optional extractor to be used for filtering undesirable results from the result set. Any result which overlaps with an exclusion match will be discarded.

Subtraction Extractor Type: Text Extractor

An optional extractor to be used for removing content from output values. If an extractor is specified, it will be executed against each final output value. Any content which matches the extractor will be removed from the output value. If the resulting output value is empty or contains only whitespace characters, the entire output value will be discarded.

The extractor specified here MUST be match a contiguous sequence of characters within the text flow. As such, the extractor cannot use any Collation Provider Methods which combine instances geometrically.

Output
Collation Type: Collation Provider

Defines how extractor results are interpreted and transformed into the final result set.

Order By Type: SortOrder, Default: Position

Specifies the sort order of the result set. Can be one of the following values:

  • Position - Results are ordered by their position within the content flow.
  • Frequency - Results are ordered by the number of occurrences of each distinct value.
  • Confidence - Results are ordered by confidence.
  • Extractor - Results are ordered by the extractor which produced each match. Can be used to prioritize the results from one extractor those of over another.
  • Length - Results are ordered by the length of the value.
  • Value - Results are ordered by value.

Direction Type: SortDirection, Default: Ascending

Specifies whether the result set should be ordered in ascending or descending order. Can be one of the following values:

  • Ascending - Results are returned in ascending order, where smaller values appear before larger values.
  • Descending - Results are returned in descending order, where larger values appear before smaller values.

Lookup Type: Lookup Options

Defines lexicon lookup and fuzzy matching options.

Result Filter Type: Result Filter

Specifies options for filtering output instances.

Result Options Type: Result Options

Specifies optional processing for each output instance.

Post Processing Type: Result Processor

Specifies an optional post-processing operation to the applied to each output instance.

Deduplication
Deduplicate By Type: DeduplicationMode, Default: None

Specifies the mode to be used for deduplicating overlapping results. Can be one of the following values:

  • None - No deduplication of overlapping results will be performed.
  • Area - The instance occupying the largest geometric region will win. In the case of a tie, the instance with the highest confidence will win.
  • Length - The instance matching the longest span will win. In the case of a tie, the instance with the highest confidence will win.
  • Confidence - The instance with the highest confidence will win. In the case of a tie, the instance with the greatest length will win.
  • Count - The instance matching the most characters will win. In the case of a tie, the instance with the highest confidence will win.
When multiple results occupy an overlapping region in the document content, deduplication ensures that only one of them will "win", and be included in the output. Below are some common uses for deduplication:
  • In some Data Types, each extractor is attempting to match the same value using different techniques for redundancy purposes. In the event that one method fails, the other may succeed. When multiple extractors do succeed, it is can be undesirable to have multiple copies of the same result in the output. Deduplication by any method will ensure that only a single result is included in the output.
  • Other Data Types may prefer a match by Extractor A over a match by Extractor B. In such cases, the extractors would each be configured to output a confidence value reflecting their relative desirability level, and results would be deduplicated by confidence.
  • There are also cases where a Data Type is extracting values which are self-containing. For example, Extractor A matches "OWNERSHIP REPORT" and Extractor B matches "MINERAL OWNERSHIP REPORT". In document containing the text "MINERAL OWNERSHIP REPORT", both would successfully match. Deduplicating by length (or area) in this case would discard "OWNERSHIP REPORT" and include "MINERAL OWNERSHIP REPORT" in the output.

Distinct Values Type: Boolean, Default: False

If True, duplicate values will be eliminated, leaving only the first instance of each distinct value. Note that deduplication is not case-sensitive.

Commands

Command Name Shortcut Keys Description
Add Multiple Items Creates multiple items as children of the selected object.
Clear Children Deletes all children of the selected object(s).
Export to Zip Archive Exports a set of Grooper nodes to a ZIP archive.
Publish to Grooper Repository Publishes one or more Nodes to one or more Target Grooper Repositories.
Unpublish Unpublishes a set of Grooper Nodes to a Target Grooper Repository.

Tabs

Tab Name Description
Data Type - GeneralProvides a user interface displaying the properties of a Data Type as well as an interface for testing the Data Type using test batch documents.
Grooper Node - ScriptingProvides script viewing, compilation, management, and basic editing features.
Grooper Node - ContentsProvides a user interface for viewing and managing the children of a Grooper Node.
Grooper Node - AdvancedDisplays detailed information about Grooper Node objects, and provides administrative functions for managing them.

See Also

Collation Provider, Culture Data, Data Pattern, Data Type, Field Class, Lookup Options, Reference, Result Filter, Result Options, Result Processor, Storage Type, Text Pattern

Used By

Embedded Extractor, Mail Import, Redact, Reference, Render