Data Type

A Data Type defines logic for extracting data values or data structures from the text content of a document.

Remarks

Data Types are one of the primary building blocks at the foundation of Grooper's data extraction and classification capabilities. They can be used to match simple data values such as dates and phone numbers, or more complex data structures such as key-value pairs and table rows.

Each data type specifies one or more "extractors", along with settings which control how the matches produced by those extractors are collated into a final result set. Different Collation Providers allow Data Types to perform a wide variety of data extraction tasks - from matching individual words or data values to extraction of complex 2-dimensional structures such as address blocks, table rows, or entire sections of information.

Defining Extractors

Extractors may be defined through the methods outlined below. If extractors are defined using multiple methods, they will be executed in the order shown.

Internal Pattern: The 'Pattern' property provides a single Data Pattern which can be used for simple Data Types.
Direct Children: Data Formats and other Data Types may be created as children of a Data Type.
Referenced Extractors: The 'Extractors' property can be used to reference one or more external Data Types or Field Classes.

Extractors and Collation Providers

The set of extractors to be created for a given Data Type depends on the task at hand and the Collation Provider in use. The default collation provider, Individual, simply returns all results from all extractors. In this mode, each extractor is typically designed to match a different variation of the target data. For example, in a Data Type designed to capture dates, each extractor might match a different date/time format. (i.e. '01/01/2000', 'January 1, 2000', '01-JAN-2000', etc.)

Other collation providers interpret extractor results as more complex entities such as key-value pairs, arrays, table rows, data regions, and arbitrary data shapes. For a complete list of collation providers, see the 'Collation' property.

Inherits from: Grooper Node

Properties

The following 17 properties are defined.

Property Name	Description
General
Value Type	Type: Storage Type Specifies the base type used for interpreting captured data values. If a captured value cannot be converted to the Value Type, it will be excluded from the output by default. The 'Allow Invalid Results' property of the Result Filter can be used to override this behavior. If the specified type is a formattable type, each output value will be reformatted based an a configurable Format Specifier.
Culture Filter	Type: List of Culture Data An optional list of cultures supported by this extractor. If this value is empty, the extractor will execute against all documents. Otherwise, the extractor will only execute on documents with a compatible culture code.
Description	Type: String Specifies a description for the item.
Data Extraction
Pattern	Type: Data Pattern Defines an internal Data Pattern which can be used in place of a child Data Format. This property is useful for simple extractions where only a single format needs to be defined.
Referenced Extractors	Type: List of Extractor Node Defines an optional list of external extractors to be executed. At runtime, referenced extractors execute after the internal pattern and the direct children have been executed.
Input Filter	Type: Text Extractor An optional extractor which restricts the scope of data extraction to a subset of the input. When an Input Filter is specified, extraction logic executes against each match produced by the Input Filter, rather than against the input itself. This mechanism can be used to restrict extraction operations to a subset of the original input. In some cases, input filters are used simplify data extraction logic by restricting the scope to a specific page or section of the document. In other cases, input filters are used to reduce the execution time of fuzzy extractors by limiting the amount of content they execute against. The following examples demonstrate some commonly-used input filters: Restrict to the first page of the document: ^[^f]+ Restrict to the last page of the document: [^f]+$ Restrict to first 5 lines of the document: ^([^\r\n]+\r\n){5} Restrict to top 3 lines of each page: (^\|\f)([^\r\n]+\r\n){3} Restrict to the PERSONAL INFO section: \r\nPERSONAL INFO[^\f]+\r\nEDUCATION
Exclusion Extractor	Type: Text Extractor An optional extractor to be used for filtering undesirable results from the result set. Any result which overlaps with an exclusion match will be discarded.
Subtraction Extractor	Type: Text Extractor An optional extractor to be used for removing content from output values. If an extractor is specified, it will be executed against each final output value. Any content which matches the extractor will be removed from the output value. If the resulting output value is empty or contains only whitespace characters, the entire output value will be discarded. The extractor specified here MUST be match a contiguous sequence of characters within the text flow. As such, the extractor cannot use any Collation Provider Methods which combine instances geometrically.
Output
Collation	Type: Collation Provider Defines how extractor results are interpreted and transformed into the final result set.
Order By	Type: SortOrder, Default: Position Specifies the sort order of the result set. Can be one of the following values: Position - Results are ordered by their position within the content flow. Frequency - Results are ordered by the number of occurrences of each distinct value. Confidence - Results are ordered by confidence. Extractor - Results are ordered by the extractor which produced each match. Can be used to prioritize the results from one extractor those of over another. Length - Results are ordered by the length of the value. Value - Results are ordered by value.
Direction	Type: SortDirection, Default: Ascending Specifies whether the result set should be ordered in ascending or descending order. Can be one of the following values: Ascending - Results are returned in ascending order, where smaller values appear before larger values. Descending - Results are returned in descending order, where larger values appear before smaller values.
Lookup	Type: Lookup Options Defines lexicon lookup and fuzzy matching options.
Result Filter	Type: Result Filter Specifies options for filtering output instances.
Result Options	Type: Result Options Specifies optional processing for each output instance.
Post Processing	Type: Result Processor Specifies an optional post-processing operation to the applied to each output instance.
Deduplication
Deduplicate By	Type: DeduplicationMode, Default: None Specifies the mode to be used for deduplicating overlapping results. Can be one of the following values: None - No deduplication of overlapping results will be performed. Area - The instance occupying the largest geometric region will win. In the case of a tie, the instance with the highest confidence will win. Length - The instance matching the longest span will win. In the case of a tie, the instance with the highest confidence will win. Confidence - The instance with the highest confidence will win. In the case of a tie, the instance with the greatest length will win. Count - The instance matching the most characters will win. In the case of a tie, the instance with the highest confidence will win. When multiple results occupy an overlapping region in the document content, deduplication ensures that only one of them will "win", and be included in the output. Below are some common uses for deduplication: In some Data Types, each extractor is attempting to match the same value using different techniques for redundancy purposes. In the event that one method fails, the other may succeed. When multiple extractors do succeed, it is can be undesirable to have multiple copies of the same result in the output. Deduplication by any method will ensure that only a single result is included in the output. Other Data Types may prefer a match by Extractor A over a match by Extractor B. In such cases, the extractors would each be configured to output a confidence value reflecting their relative desirability level, and results would be deduplicated by confidence. There are also cases where a Data Type is extracting values which are self-containing. For example, Extractor A matches "OWNERSHIP REPORT" and Extractor B matches "MINERAL OWNERSHIP REPORT". In document containing the text "MINERAL OWNERSHIP REPORT", both would successfully match. Deduplicating by length (or area) in this case would discard "OWNERSHIP REPORT" and include "MINERAL OWNERSHIP REPORT" in the output.
Distinct Values	Type: Boolean, Default: False If True, duplicate values will be eliminated, leaving only the first instance of each distinct value. Note that deduplication is not case-sensitive.

Commands

	Command Name	Shortcut Keys	Description
	Add Multiple Items		Creates multiple items as children of the selected object.
	Clear Children		Deletes all children of the selected object(s).
	Export to Zip Archive		Exports a set of Grooper nodes to a ZIP archive.
	Publish to Grooper Repository		Publishes one or more Nodes to one or more Target Grooper Repositories.
	Unpublish		Unpublishes a set of Grooper Nodes to a Target Grooper Repository.

Tabs

Tab Name	Description
Data Type - General	Provides a user interface displaying the properties of a Data Type as well as an interface for testing the Data Type using test batch documents.
Grooper Node - Scripting	Provides script viewing, compilation, management, and basic editing features.
Grooper Node - Contents	Provides a user interface for viewing and managing the children of a Grooper Node.
Grooper Node - Advanced	Displays detailed information about Grooper Node objects, and provides administrative functions for managing them.

Used By

Embedded Extractor, Mail Import, Redact, Reference, Render