Grooper.Activities.Recognize

Recognizes the internal content of a document or page, and saves the resulting data for use by subsequent activities. This activity detects and reads the presentation elements which convey information in a document, such as text segments, barcodes, lines, check boxes, and other shapes. The resulting character data and layout information is saved on the Batch Folder or Batch Page object being processed, where it is available to subsequent activities which depend on recognition results.

Recognize should be executed in a Batch Process after any permanent image cleanup has been applied, and before any activities which depend on recognition output. (See further discussion below.) It can be executed at the page level or at the folder level.

Page-Level Operation

When executed at the page level, recognizes one page at a time. If the input page has PDF content (i.e. was created by splitting a PDF document), then the page will be handled as a PDF page for the purposes of native text extraction. Otherwise, the page will be processed as an image, and text extraction will be purely OCR-based.

Page-level processing is the preferred operating mode in many cases, as it maximizes the benefits from parallel processing.

Folder-Level Operation

When executed at the folder level, recognizes all pages of a multipage document at once. The input folder must have a PDF or image-based document attached. To recognize other document formats such as Microsoft Word, HTML, and etc., use the Render activity to generate a PDF version prior to executing this activity.

Due to the CPU-intensive nature of recognition, folder-level processing may be unsuitable for large documents. For example, a single task running recognition on a 1000-page document could take 20 minutes to complete. This can result in long-running tasks which appear hung and services which are difficult to start and stop. In such cases, split the document into pages (see Content Action) and run recognition at the page level.

Activities Depending on Recognize

Any activity which accesses the internal content of a document will require recognition results in order to function properly. The following are specific examples of cases where other activities depend on the output from Recognize:


Inherits from: Grooper.Core.UnattendedActivity

Constructors

Signature Description
New (gdb As GrooperDb)
Parameters
gdb
          Type: GrooperDb
          

Fields

Field Name Field Type Description
AlternateIpId As System.Guid System.Guid
Database As Grooper.GrooperDb Grooper.GrooperDb
DiagnosticInfo As Grooper.IP.DiagnosticInfo Grooper.IP.DiagnosticInfo
OcrProfileId As System.Guid System.Guid

Properties

Property Name Property Type Description
ActivityStats Grooper.StatDictionary Dictionary of statistics for the batch processing activity.
AlternateIP Grooper.IP.IpProfile An optional IP Profile used for detecting layout elements on PDF pages, in cases where OCR will not be used. The IP Profile specified here should contain IP Commands such as Line Detection, Box Detection, Barcode Detection, and Shape Detection. This IP Profile is used for detection purposes only, and the output image will be discarded.
ConcurrencyMode Grooper.ConcurrencyModeAttribute.ConcurrencyMode Specifies the parallel processing mode for this activity. This value determines the type of Thread Pool on which the activity can be executed.Can be one of the following values:
  • Multiple: Multiple instances can run concurrently.
  • PerMachine: Only a single instance can run per machine.
  • Single: Only a single instance can run per Grooper repository.
ErrorDisposition Grooper.Core.UnattendedActivity.IssueDisposition Determines what happens when an error occurs processing an activity.A combination of the following flags:
  • None: The issue will be ignored, and the item will complete successfully.
  • Flag: The associated Batch Folder or Batch Page will be flagged.
  • Log: The issue will be logged to the Grooper log. The log can be viewed from the Grooper Root node under the Batch Event Viewer tab.
  • Stop: The Batch will stop processing, be set to an error state, and all pending tasks will be deleted.
HasReferenceProperties System.Boolean Returns true if the object has properties which reference Grooper Node objects.
IsEmpty System.Boolean Returns true if all properties with a ViewableAttribute are set to their default value.
IsWriteable System.Boolean Returns true if the object is writable, or false if it is not.
LayoutDetectionSummary System.String Summarizes the non-text features to be detected based on the configured OCR Profile. Layout Data is information describing the non-text elements of a page, such as lines, check boxes, barcodes, and shapes. It is important to capture and save layout data in cases where data extraction logic relies on the location of these elements.

Layout Data is generated when the OCR Profile is configured with an IP Profile containing certain IP Commands. For example, Line Detection, Line Removal, Box Detection, Box Removal, Barcode Detection, Barcode Removal, Shape Detection, and Shape Removal are all examples of commands which generate layout data. Generally speaking, it is best to use the dropout flavor of these commands during pre-OCR image processing.

MaximumConsecutiveErrors System.Int32 The maximum number of consecutive errors, after which a critical stop will be raised. A critical stop will cause services to stop running.
NativeTextExtraction Grooper.Activities.Recognize.TextExtractMode Specifies how text should be extracted from PDF files. When enabled, reads native PDF text segments directly, rather than through OCR. In applicable use cases, this mechanism delivers 100% accurate character extraction, avoiding the uncertainty of OCR.

Note that the text extraction process operates against all text objects drawn on the page - whether they are actually visible or not. Unexpected results can occur when the input document contains text drawn transparently or behind other objects.

This setting is only applicable if the input is PDF content. When running at the folder level, this means that the input document must be a native PDF document, or must have a PDF version generated by the Render activity. When running at the page level, this means that the page object must have been created by splitting a PDF document.

Can be one of the following values:
  • Full: Native text segments and form fields will be extracted.
  • Simple: Only native text segments will be extracted.
  • None: No effort will be made to read native text segments. PDF pages will be treated as images and processed through OCR.
OcrAssist Grooper.Activities.Recognize.OcrAssistMode Specifies the extent to which OCR will be used in conjunction with Native Text Extraction. It is common for PDF pages to be constructed from mixed content, where information is presented as a combination of native text, text annotations, and/or text drawn on images. In these cases, OCR Assist supplements the native text extraction using OCR. OCR results obtained through this process are combined with the native text segments to produce a complete output document. Can be one of the following values:
  • Auto: OCR will be applied selectively to PDF pages which contain images, text annotations, or other features requiring OCR.
  • Always: OCR will be performed on all PDF pages.
  • None: No OCR will be performed on PDF pages. Use this mode if processing documents which are 100% text-based PDF files.
OcrProfile Grooper.OCR.OcrProfile Specifies the OCR Profile to be used for text recognition. The OCR Profile specified here is used to perform full-page OCR, and to detect layout objects such as lines, check boxes, barcodes, and shapes.

The IP Profile assigned to the OCR Profile should include IP Commands which detect the set of layout objects needed for data extraction. See the 'Layout Detection Summary' property for more information.

PageFilter System.String Restricts recognition to specific page numbers. If this value is blank, recognition will be performed on all pages. Otherwise, recognition will only be performed on pages specified in the filter.
Root Grooper.GrooperRoot Returns the root node
StatNames System.Collections.Generic.IEnumerable(Of T) Returns all possible statistic names which could be logged for the Activity. Derived classed should override this method to return all stat names which will be used in calls to AddCustomStatValue().

Methods

Method Name Description
AddDiagImage(Name As String, Image As GrooperImage, Annotations As IEnumerable(Of Annotation))
Parameters
Name
          Type: String
          
 
Image
          Type: GrooperImage
          
 
Annotations
          Type: IEnumerable`1
          
EnableDiagMode()
GetAnnotations(PageIndex As Int32, Results As RecognitionResult) As IEnumerable(Of Annotation)
Parameters
PageIndex
          Type: Int32
          
 
Results
          Type: RecognitionResult
          
GetProperties() As PropertyDescriptorCollection
GetReferences() As List(Of GrooperNode) Returns a list of GrooperNode objects referenced in the properties of this object.
InsertDiagImage(Index As Int32, Name As String, Image As GrooperImage, Annotations As IEnumerable(Of Annotation))
Parameters
Index
          Type: Int32
          
 
Name
          Type: String
          
 
Image
          Type: GrooperImage
          
 
Annotations
          Type: IEnumerable`1
          
IsPropertyEnabled(PropertyName As String) As Nullable(Of Boolean) Defines whether a property is currently enabled.
Parameters
PropertyName
          Type: String
          The name of the property to determine the enabled state for.
IsPropertyVisible(PropertyName As String) As Nullable(Of Boolean) Defines whether a property is currently visible.
Parameters
PropertyName
          Type: String
          The name of the property to determine the visible state for.
IsType(Type As Type) As Boolean Returns true if the object is of the type specified, or if it derives from the type specfied.
Parameters
Type
          Type: Type
          The type to check.
LogStatValue(Name As String, Value As Double) Adds a custom stat value to the Batch Processing Activity statistics.
Parameters
Name
          Type: String
          
 
Value
          Type: Double
          
ProcessTask(CurNode As BatchObject) Mandatory override to implement processing logic.
Parameters
CurNode
          Type: BatchObject
          The current batch object being processed.
RunTask(CurNode As BatchObject, ipd As IProgressDisplay) As RecognitionResult
Parameters
CurNode
          Type: BatchObject
          
 
ipd
          Type: IProgressDisplay
          
Serialize() As String Serializes the object.
SetDatabase(Database As GrooperDb) Sets the database connection of the object.
Parameters
Database
          Type: GrooperDb
          
ToString() As String Returns the display name for this activity type.
ValidateProperties() As ValidationErrorList Validates the properties of the object, returning a list of validation errors.
Verify()
WriteLogEntry(Message As String, pa() As Object()) Adds an entry to the Diagnostic Info Log.
Parameters
Message
          Type: String
          
 
pa
          Type: Object
          
WriteLogEntry(TabLevel As Int32, Message As String, pa() As Object()) Adds an entry to the Diagnostic Info Log.
Parameters
TabLevel
          Type: Int32
          Level to indent the message within the log.
 
Message
          Type: String
          
 
pa
          Type: Object