Grooper.Core.Lexical

Classifies documents based on their text content, using pre-configured training and/or rules. Lexical classification is configured by defining a set of Document Types, and then teaching Grooper to recognize each document type. This can be done by training Grooper with one or more samples of the document, or by defining hand-coded rules which identify the document type.

Training-Based Classification

The training-based approach measures document similarity by analyzing the frequency of features which appear in the document. In the simplest case, a "feature" is an individual word, and training is a process of recording the word frequencies of each document type. At classification time, the word frequencies found on a document will be compared to the word frequencies of document types in the training database, generating a similarity value for each document type.

Training can be performed using the Content Type - Classification Testing tab, using the Classify Review activity, or using an instance of the Review activity configured to display the Classification Viewer control.

Rules-Based Classification

The rules-based approach relies on classification rules defined on individual Document Types, which take the form of 'Positive Extractor' and a 'Negative Extractor' properties. When a document is being classified, rules are applied first, and therefore always take precedence over training-based results. If the Positive Extractor for 'Document Type A' produces any hits on the document, then the document will be classified as 'Document Type A' with no further analysis. If the Negative Extractor for 'Document Type A' produces any hits, then 'Document Type A' will be excluded from the set of possible results for the document.

Usage Notes

If the corpus contains a small number of structured document types, a rules-based approach can yield highly accurate results with minimal setup time. Simply define a positive rule on each document type which targets a unique text string on the document, such as the title.

For corpuses containing a large number of document types, or which contain highly unstructured document types, a rules-based approach is less practical, because there are too many possibilities to account for with manual rules. In these cases, a training-based approach is more suitable.

In practice, many implementations use a combination of rules-based and training-based classification, where the bulk of classification is training-based, and rules are used to fine-tine the distinction between similar document types, prevent false positives, and etc.


Inherits from: Grooper.Core.ClassifyMethod

Constructors

Signature Description
New (Owner As ConnectedObject)
Parameters
Owner
          Type: ConnectedObject
          

Fields

Field Name Field Type Description
Database As Grooper.GrooperDb Grooper.GrooperDb

Properties

Property Name Property Type Description
EpiExtractor Grooper.Core.EmbeddedExtractor Defines an extractor which is used by ESP Auto Separation to find page numbers embedded in the document content. The provided Extractor must define and output a group named 'PageNo', and optionally may define and output a group named 'PageCount'. For example, if the document set contains page numbering like 'Page 1 of 4', the following pattern would generate the required group names: Page (?<PageNo>\d+) of (?<PageCount>\d+).
HasReferenceProperties System.Boolean Returns true if the object has properties which reference Grooper Node objects.
ImageFeatureExtractor Grooper.IP.IpProfile An optional IP Profile to be used for extracting image-based features.
IsEmpty System.Boolean Returns true if all properties with a ViewableAttribute are set to their default value.
IsWriteable System.Boolean Returns true if the object is writable, or false if it is not.
Owner Grooper.ConnectedObject Returns the node that owns the connected object, if any.
OwnerNode Grooper.GrooperNode Returns the node that owns the connected object, if any.
Root Grooper.GrooperRoot Returns the root node
SmoothIDF System.Boolean Defines how the frequency of a feature across the set of document types impacts its weighting. When this value is false, the standard inverse document frequency (IDF) is used: IDF = Log(Classes / ClassesWithFeature). In this mode, a feature appearing in all classes will have an IDF of 0.

When this value is true, +1 smoothing is added: IDF = Log(1 + Classes / ClassesWithFeature). The most notable impact of this is that the IDF can never reach zero, even if the feature appears in every class.

SubLinearScaling System.Boolean When enabled, term frequency values will be scaled logarithmically. Term frequency (TF) refers to the number of times a term occurs in a document. In many cases it seems unlikely that 20 occurrences of a term in a document truly carry 20 times the significance of a single occurrence. Accordingly, there has been considerable research into variants of term frequency that go beyond counting the number of occurrences of a term. This option enables a common modification to TF-IDF which uses the logarithm of the term frequency rather than the raw term frequency.
TextFeatureExtractor Grooper.Core.EmbeddedExtractor Matches the features on a document which should be used for training-based classification. Each value matched by this extractor will be considered a feature. Training-based classification measures the similarity between a document and a Document Type by comparing the set of "features" which appear on document to the set of features found on previously-trained examples.

Single Words

The most common type of feature used in classification is a single word, or "unigram". When classifying on unigrams, there are a few important things to consider when building a feature extractor:

  • Features which appear more frequently are weighted higher. Because of this, it is important to filter out stop words. Stop words are words such as "and" or "the" which appear frequently, but have little value in a classification process. It is important that the feature extractor use a lexicon of stop words to the filter these out. When good stop word filtering is absent, the frequencies of stop words can dominate the classification model, reducing Grooper's ability to distinguish between document types. There are many online sources for stop word lists, and Grooper provides downloadable stop word lists in 30+ languages.
  • Word stemming can produce better classification features. If the extractor has Porter Stemming enabled, then each word found on the document will be reduced to a root form. For example, the words 'insurance', 'insure', 'insuring', 'insured', and 'insures' would all stem down to the root word 'insur'. Stemming reduces the amount of training required, because Grooper does not have to learn on its own that every variation means the same thing.

nGram Features (bigrams, trigrams, etc.)

In cases where word frequency alone cannot distinguish all document types in the corpus, nGram features can be used. nGrams are sequences of words: a bigram is a pair of two adjacent words, a trigram is a triplet of three adjacent words, and so on. While it is informative to know that a document contains the word "well" and the word "oil", it is much more informative to the know that the document contains the phrase "oil well".

nGram extraction can add significant processing overhead, and should only be used in cases where other methods have been deemed inadequate.

Other Feature Types.

Feature extractors can target more than just words - they can also target data and elements of natural language. This is accomplished by creating an extractor which matches information on the document, but returns a text token in its place. For example, a document containing many VIN numbers is more likely to be a Vehicle Inventory Sheet than a Sick Leave Request. Including a feature extractor which matches VIN numbers on t6e document and returns the token "VIN_Number" would allow the classification engine to consider that fact that the document contains a lot of VIN numbers.

UseCF System.Boolean When enabled, class frequency will be considered in lexical feature weightings. Class frequency (CF) refers to the number of times a feature appears. Modifies the underlying weighting mechanism from TF-IDF to TF-IDF-CF. See An improvement of TF-IDF weighting in text categorization - Mingyong Liu and Jiangang Yang, Zhejiang University - 2012 International Conference on Computer Technology and Science.

Methods

Method Name Description
ClassifyFolder(Folder As BatchFolder, Scope As ContentType, Level As ClassificationLevel) As ContentTypeCandidateList
Parameters
Folder
          Type: BatchFolder
          
 
Scope
          Type: ContentType
          
 
Level
          Type: ClassificationLevel
          
ClassifyPage(Page As BatchPage, Scope As ContentType) As PageTypeCandidateList
Parameters
Page
          Type: BatchPage
          
 
Scope
          Type: ContentType
          
ClearCache()
ComparePage(pt As PageType, Features As FeatureDictionary) As Double
Parameters
pt
          Type: PageType
          
 
Features
          Type: FeatureDictionary
          
ExtractEpiData(Source As DataInstance, Result As PageTypeCandidateList)
Parameters
Source
          Type: DataInstance
          
 
Result
          Type: PageTypeCandidateList
          
ExtractFeatures(Page As IPage) As FeatureDictionary
Parameters
Page
          Type: IPage
          
GenerateModel(ct As ContentType, Language As CultureData) As TfIdfModel
Parameters
ct
          Type: ContentType
          
 
Language
          Type: CultureData
          
GetCompiler(Scope As ContentType, Level As ClassificationLevel, Culture As CultureData) As TfIdfWeightingCompiler
Parameters
Scope
          Type: ContentType
          
 
Level
          Type: ClassificationLevel
          
 
Culture
          Type: CultureData
          
GetProperties() As PropertyDescriptorCollection
GetReferences() As List(Of GrooperNode) Returns a list of GrooperNode objects referenced in the properties of this object.
IsPropertyEnabled(PropertyName As String) As Nullable(Of Boolean) Defines whether a property is currently enabled.
Parameters
PropertyName
          Type: String
          The name of the property to determine the enabled state for.
IsPropertyVisible(PropertyName As String) As Nullable(Of Boolean) Defines whether a property is currently visible.
Parameters
PropertyName
          Type: String
          The name of the property to determine the visible state for.
IsType(Type As Type) As Boolean Returns true if the object is of the type specified, or if it derives from the type specfied.
Parameters
Type
          Type: Type
          The type to check.
Serialize() As String Serializes the object.
SetDatabase(Database As GrooperDb) Sets the database connection of the object.
Parameters
Database
          Type: GrooperDb
          
SetOwner(Owner As ConnectedObject, SkipInitialization As Boolean) Sets the owner of the connected object with another object that implements the IConnected interface.
Parameters
Owner
          Type: ConnectedObject
          
 
SkipInitialization
          Type: Boolean
          
ToString() As String Returns a string value representation of the connected object.
TrainPages(dt As DocumentType, Pages As IEnumerable(Of IPage), Culture As CultureData, ipd As IProgressDisplay) As List(Of PageType) Trains this document type with the provided document sample. The provided sample is a list of pages which make up the document.
Parameters
dt
          Type: DocumentType
          
 
Pages
          Type: IEnumerable`1
          The document to be trained.
 
Culture
          Type: CultureData
          
 
ipd
          Type: IProgressDisplay
          
ValidateCache()
ValidateProperties() As ValidationErrorList Validates the properties of the object, returning a list of validation errors.