Data Type

A Data Type defines extraction logic for a distinct type of data, such as a field value or a table row. Each data type defines one or more extractors, along with settings which control how the extractor results are transformed into a final result set.

Remarks

At runtime, a Data Type will execute the following extractors, in the order shown.

The results returned from the individual extractors are then transformed into a final result set based on various output options. The default behavior is that the output will contain all results from all extractors, in the order in which the appear in the document.

Inherits from: Grooper Node

Properties

The following 16 properties are defined.

Property Name Description
General
Value Type Type: Storage Type, Default: String

Defines the type of data this extractor will capture. Can be one of the following values:

  • Boolean - Represents a Boolean (true or false) value.
  • DateTime - Represents an instant in time, typically expressed as a date and/or time of day.
  • Decimal - Represents a decimal value.
  • Double - Represents a 64-bit floating point value.
  • GUID - Represents a globally unique identifier (GUID).
  • Int16 - Represents a 16-bit integer value.
  • Int32 - Represents a 32-bit integer value.
  • Int64 - Represents a 64-bit integer value.
  • String - String values can store any type of text information.
  • URL - A Uniform Resource Locator (URL) is a string of characters used to identify a web resource, such as a web page on an HTTP server, or a file on an FTP server.
If a captured value cannot be converted to the base type, it will be excluded from the output, unless the Allow Invalid Results property of the Result Filter is set to True.

Culture Filter Type: List of Culture Data

Defines a list of cultures supported by this extractor. If this value is empty, the extractor will execute against all documents. Otherwise, the extractor will only execute on documents which map to one of the specified cultures.

Description Type: String

Generic property allowing an administrator to document the purpose of this Grooper Node.

Data Extraction
Pattern Type: Data Pattern

Defines an internal Data Pattern which can be used in place of a child Data Format or Data Type. This property is useful for simple extractions where only one format needs to be defined.

Referenced Extractors Type: List of Grooper Node

Defines an optional list of external extractors to be executed. At runtime, referenced extractors execute after the internal pattern and the direct children have been executed.

Input Filter Type: Embedded Extractor

An optional extractor to be used for transforming input prior to extraction. Input filters are used to select a subset of the source content prior to running the extractors. In many cases extraction logic can be simplified if scope is limited to a small portion of the document. When an input filter is specified, it is executed against the source. The Data Type's extractors are then executed on each instance returned by the input filter.

Exclusion Extractor Type: Embedded Extractor

An optional extractor to be used for filtering undesirable results from the result set. Any output instances which overlap with an exclusion instance will be discarded.

Subtraction Extractor Type: Embedded Extractor

An optional extractor to be used for removing content from output values. If an extractor is specified, it will be executed against each final output value. Any content which matches the extractor will be removed from the output value. If the resulting output value is empty or contains only whitespace characters, the entire output value will be discarded.

The extractor specified here MUST be match a contiguous sequence of characters within the text flow. As such, the extractor cannot use any Collation Provider Methods which combine instances geometrically.

Output
Collation Type: Collation Provider

Defines how instances from individual extractors are transformed into the final output. Can be one of the following values:

  • Array - Matches a list of values arranged in horizontal, vertical, or flow order.
  • Combine - Combines instances from child extractors based on the grouping specified in the Group By property.
  • Individual - Combines the individual results from all extractors into a single result set.
  • Key-Value List - Matches cases where a key and a list of 1 or more values occur on the document in a specific layout
  • Key-Value Pair - Matches cases where a key-value pair occur on the document in a specific layout.
  • Multi-Column - Output a single instance where the document has been reformatted to reflect the flow of a multi-column document.
  • Ordered Array - Finds sequences of values where one result is present for each extractor, in the order in which they appear.
  • Pattern-Based - Uses a regular expression to select a sequence of child extractor results.
  • Split - Splits the input at each match found by an extractor.

Order By Type: SortOrder, Default: Position

Controls the output order of the result set. Can be one of the following values:

  • Position - Results are ordered by their position within the content flow.
  • Frequency - Results are ordered by the number of occurrences of each distinct value.
  • Confidence - Results are ordered by confidence.
  • Extractor - Results are ordered by the extractor which produced each match. Can be used to prioritize the results from one extractor those of over another.
  • Length - Results are ordered by the length of the value.
  • Value - Results are ordered by value.

Direction Type: SortDirection, Default: Ascending

Controls the output order of the result set. Can be one of the following values:

  • Ascending - Results are returned in ascending order, where smaller values appear before larger values.
  • Descending - Results are returned in descending order, where larger values appear before smaller values.

Result Filter Type: Result Filter

Specifies options for filtering output instances.

Result Options Type: Result Options

Specifies optional processing for each output instance.

Post Processing Type: Result Processor

Specifies an optional post-processing operation to the applied to each output instance. Can be one of the following values:

  • OCR Reader - Extracts text from a region near each output instance.
  • OMR Reader - Treats each extractor result as the label for an OMR zone, and attempts to detect and read the associated checkboxes.
  • Place Zone - Places a zone relative to the output instance.

Deduplication
Deduplicate Locations Type: Boolean, Default: False

If True, instances with overlapping zones will be de-duplicated, with precedence given to larger data elements.

Deduplicate Values Type: Boolean, Default: False

If True, duplicate values will be eliminated, leaving only the first instance of the value.

Commands

Command Name Shortcut Keys Description
Add Multiple Items Creates multiple items as children of the selected object.
Clear Children Deletes all children of the selected object(s).
Export to Zip Archive Exports a set of Grooper nodes to a ZIP archive.
Publish to Grooper Repository Publishes one or more Nodes to one or more Target Grooper Repositories.
Unpublish Unpublishes a set of Grooper Nodes to a Target Grooper Repository.

Tabs

Tab Name Description
Data Type - GeneralProvides a user interface displaying the properties of a Data Type as well as an interface for testing the Data Type using test batch documents.
Grooper Node - ScriptingProvides script viewing, compilation, management, and basic editing features.
Grooper Node - ContentsProvides a user interface for viewing and managing the children of a Grooper Node.
Grooper Node - AdvancedDisplays detailed information about Grooper Node objects, and provides administrative functions for managing them.

See Also

Collation Provider, Culture Data, Data Pattern, Embedded Extractor, Grooper Node, Result Filter, Result Options, Result Processor, Storage Type

Used By

Embedded Extractor, Mail Import, Redact, Reference, Render