Grooper.Core.Lexical

Classifies documents based on their text content, using pre-configured training and/or rules. Lexical classification is configured by defining a set of Document Types, and then teaching Grooper to recognize each document type. This can be done by training Grooper with one or more samples of the document, or by defining hand-coded rules which identify the document type.

Training-Based Classification

The training-based approach measures document similarity by analyzing the frequency of features which appear in the document. In the simplest case, a "feature" is an individual word, and training is a process of recording the word frequencies of each document type. At classification time, the word frequencies found on a document will be compared to the word frequencies of document types in the training database, generating a similarity value for each document type.

Training can be performed using the Content Type - Classification Testing tab, using the Classify Review activity, or using an instance of the Review activity configured to display the Classification Viewer control.

Rules-Based Classification

The rules-based approach relies on classification rules defined on individual Document Types, which take the form of 'Positive Extractor' and a 'Negative Extractor' properties. When a document is being classified, rules are applied first, and therefore always take precedence over training-based results. If the Positive Extractor for 'Document Type A' produces any hits on the document, then the document will be classified as 'Document Type A' with no further analysis. If the Negative Extractor for 'Document Type A' produces any hits, then 'Document Type A' will be excluded from the set of possible results for the document.

Usage Notes

If the corpus contains a small number of structured document types, a rules-based approach can yield highly accurate results with minimal setup time. Simply define a positive rule on each document type which targets a unique text string on the document, such as the title.

For corpuses containing a large number of document types, or which contain highly unstructured document types, a rules-based approach is less practical, because there are too many possibilities to account for with manual rules. In these cases, a training-based approach is more suitable.

In practice, many implementations use a combination of rules-based and training-based classification, where the bulk of classification is training-based, and rules are used to fine-tine the distinction between similar document types, prevent false positives, and etc.

Inherits from: Grooper.Core.ClassifyMethod

Constructors

Signature

Description

New (Owner As ConnectedObject)

Parameters

Owner

Type: ConnectedObject

Fields

Field Name	Field Type	Description
Database As Grooper.GrooperDb	Grooper.GrooperDb

Properties

Property Name	Property Type	Description
EpiExtractor	Grooper.Core.EmbeddedExtractor	Defines an extractor which is used by ESP Auto Separation to find page numbers embedded in the document content. The provided Extractor must define and output a group named 'PageNo', and optionally may define and output a group named 'PageCount'. For example, if the document set contains page numbering like 'Page 1 of 4', the following pattern would generate the required group names: Page (?<PageNo>\d+) of (?<PageCount>\d+).
HasReferenceProperties	System.Boolean	Returns true if the object has properties which reference Grooper Node objects.
ImageFeatureExtractor	Grooper.IP.IpProfile	An optional IP Profile to be used for extracting image-based features.
IsEmpty	System.Boolean	Returns true if all properties with a ViewableAttribute are set to their default value.
IsWriteable	System.Boolean	Returns true if the object is writable, or false if it is not.
Owner	Grooper.ConnectedObject	Returns the node that owns the connected object, if any.
OwnerNode	Grooper.GrooperNode	Returns the node that owns the connected object, if any.
Root	Grooper.GrooperRoot	Returns the root node
SmoothIDF	System.Boolean	Defines how the frequency of a feature across the set of document types impacts its weighting. When this value is false, the standard inverse document frequency (IDF) is used: IDF = Log(Classes / ClassesWithFeature). In this mode, a feature appearing in all classes will have an IDF of 0. When this value is true, +1 smoothing is added: IDF = Log(1 + Classes / ClassesWithFeature). The most notable impact of this is that the IDF can never reach zero, even if the feature appears in every class.
SubLinearScaling	System.Boolean	When enabled, term frequency values will be scaled logarithmically. Term frequency (TF) refers to the number of times a term occurs in a document. In many cases it seems unlikely that 20 occurrences of a term in a document truly carry 20 times the significance of a single occurrence. Accordingly, there has been considerable research into variants of term frequency that go beyond counting the number of occurrences of a term. This option enables a common modification to TF-IDF which uses the logarithm of the term frequency rather than the raw term frequency.
TextFeatureExtractor	Grooper.Core.EmbeddedExtractor	Matches the features on a document which should be used for training-based classification. Each value matched by this extractor will be considered a feature. Training-based classification measures the similarity between a document and a Document Type by comparing the set of "features" which appear on document to the set of features found on previously-trained examples. Single Words The most common type of feature used in classification is a single word, or "unigram". When classifying on unigrams, there are a few important things to consider when building a feature extractor: Features which appear more frequently are weighted higher. Because of this, it is important to filter out stop words. Stop words are words such as "and" or "the" which appear frequently, but have little value in a classification process. It is important that the feature extractor use a lexicon of stop words to the filter these out. When good stop word filtering is absent, the frequencies of stop words can dominate the classification model, reducing Grooper's ability to distinguish between document types. There are many online sources for stop word lists, and Grooper provides downloadable stop word lists in 30+ languages. Word stemming can produce better classification features. If the extractor has Porter Stemming enabled, then each word found on the document will be reduced to a root form. For example, the words 'insurance', 'insure', 'insuring', 'insured', and 'insures' would all stem down to the root word 'insur'. Stemming reduces the amount of training required, because Grooper does not have to learn on its own that every variation means the same thing. nGram Features (bigrams, trigrams, etc.) In cases where word frequency alone cannot distinguish all document types in the corpus, nGram features can be used. nGrams are sequences of words: a bigram is a pair of two adjacent words, a trigram is a triplet of three adjacent words, and so on. While it is informative to know that a document contains the word "well" and the word "oil", it is much more informative to the know that the document contains the phrase "oil well". nGram extraction can add significant processing overhead, and should only be used in cases where other methods have been deemed inadequate. Other Feature Types. Feature extractors can target more than just words - they can also target data and elements of natural language. This is accomplished by creating an extractor which matches information on the document, but returns a text token in its place. For example, a document containing many VIN numbers is more likely to be a Vehicle Inventory Sheet than a Sick Leave Request. Including a feature extractor which matches VIN numbers on t6e document and returns the token "VIN_Number" would allow the classification engine to consider that fact that the document contains a lot of VIN numbers.
UseCF	System.Boolean	When enabled, class frequency will be considered in lexical feature weightings. Class frequency (CF) refers to the number of times a feature appears. Modifies the underlying weighting mechanism from TF-IDF to TF-IDF-CF. See An improvement of TF-IDF weighting in text categorization - Mingyong Liu and Jiangang Yang, Zhejiang University - 2012 International Conference on Computer Technology and Science.

Methods

Method Name

Description

ClassifyFolder(Folder As BatchFolder, Scope As ContentType, Level As ClassificationLevel) As ContentTypeCandidateList

Parameters

Folder

Type: BatchFolder

Scope

Type: ContentType

Level

Type: ClassificationLevel

ClassifyPage(Page As BatchPage, Scope As ContentType) As PageTypeCandidateList

Parameters

Page

Type: BatchPage

Scope

Type: ContentType

ClearCache()

ComparePage(pt As PageType, Features As FeatureDictionary) As Double

Parameters

Type: PageType

Features

Type: FeatureDictionary

ExtractEpiData(Source As DataInstance, Result As PageTypeCandidateList)

Parameters

Source

Type: DataInstance

Result

Type: PageTypeCandidateList

ExtractFeatures(Page As IPage) As FeatureDictionary

Parameters

Page

Type: IPage

GenerateModel(ct As ContentType, Language As CultureData) As TfIdfModel

Parameters

Type: ContentType

Language

Type: CultureData

GetCompiler(Scope As ContentType, Level As ClassificationLevel, Culture As CultureData) As TfIdfWeightingCompiler

Parameters

Scope

Type: ContentType

Level

Type: ClassificationLevel

Culture

Type: CultureData

GetProperties() As PropertyDescriptorCollection

GetReferences() As List(Of GrooperNode)

Returns a list of GrooperNode objects referenced in the properties of this object.

IsPropertyEnabled(PropertyName As String) As Nullable(Of Boolean)

Defines whether a property is currently enabled.

Parameters

PropertyName

Type: String

The name of the property to determine the enabled state for.

IsPropertyVisible(PropertyName As String) As Nullable(Of Boolean)

Defines whether a property is currently visible.

Parameters

PropertyName

Type: String

The name of the property to determine the visible state for.

IsType(Type As Type) As Boolean

Returns true if the object is of the type specified, or if it derives from the type specfied.

Parameters

Type

Type: Type

The type to check.

Serialize() As String

Serializes the object.

SetDatabase(Database As GrooperDb)

Sets the database connection of the object.

Parameters

Database

Type: GrooperDb

SetOwner(Owner As ConnectedObject, SkipInitialization As Boolean)

Sets the owner of the connected object with another object that implements the IConnected interface.

Parameters

Owner

Type: ConnectedObject

SkipInitialization

Type: Boolean

ToString() As String

Returns a string value representation of the connected object.

TrainPages(dt As DocumentType, Pages As IEnumerable(Of IPage), Culture As CultureData, ipd As IProgressDisplay) As List(Of PageType)

Trains this document type with the provided document sample. The provided sample is a list of pages which make up the document.

Parameters

Type: DocumentType

Pages

Type: IEnumerable`1

The document to be trained.

Culture

Type: CultureData

ipd

Type: IProgressDisplay

ValidateCache()

ValidateProperties() As ValidationErrorList

Validates the properties of the object, returning a list of validation errors.

Grooper.Core.Lexical

Training-Based Classification

Rules-Based Classification

Usage Notes

Constructors

Fields

Properties

Single Words

nGram Features (bigrams, trigrams, etc.)

Other Feature Types.

Methods