Grooper.Core.Lexical
Classifies documents based on their text content, using pre-configured training and/or rules.
Lexical classification is configured by defining a set of Document Types, and then teaching Grooper to recognize
each document type. This can be done by training Grooper with one or more samples of the document, or by defining hand-coded rules which
identify the document type.
Training-Based Classification
The training-based approach measures document similarity by analyzing the frequency of features which appear in the document.
In the simplest case, a "feature" is an individual word, and training is a process of recording the word frequencies of each
document type. At classification time, the word frequencies found on a document will be compared to the word frequencies of
document types in the training database, generating a similarity value for each document type.
Training can be performed using the Content Type - Classification Testing tab, using the Classify Review activity, or using an instance
of the Review activity configured to display the Classification Viewer control.
Rules-Based Classification
The rules-based approach relies on classification rules defined on individual Document Types, which take the form of
'Positive Extractor' and a 'Negative Extractor' properties. When a document is being classified, rules are applied first,
and therefore always take precedence over training-based results. If the Positive Extractor for 'Document Type A' produces any hits on the document, then
the document will be classified as 'Document Type A' with no further analysis. If the Negative Extractor for 'Document Type A'
produces any hits, then 'Document Type A' will be excluded from the set of possible results for the document.
Usage Notes
If the corpus contains a small number of structured document types, a rules-based approach can yield highly accurate results with minimal setup time. Simply define
a positive rule on each document type which targets a unique text string on the document, such as the title.
For corpuses containing a large number of document types, or which contain highly unstructured document types, a rules-based approach is less practical, because there are
too many possibilities to account for with manual rules. In these cases, a training-based approach is more suitable.
In practice, many implementations use a combination of rules-based and training-based classification, where the bulk of classification is training-based, and rules are
used to fine-tine the distinction between similar document types, prevent false positives, and etc.
Inherits from: Grooper.Core.ClassifyMethod
Constructors
Signature |
Description |
New (Owner As ConnectedObject) |
Parameters |
Owner |
Type: ConnectedObject |
|
|
Fields
Field Name |
Field Type |
Description |
Database As Grooper.GrooperDb |
Grooper.GrooperDb |
|
Properties
Property Name |
Property Type |
Description |
EpiExtractor |
Grooper.Core.EmbeddedExtractor |
Defines an extractor which is used by ESP Auto Separation to find page numbers embedded in the document content. The provided Extractor must define and output a group named 'PageNo', and optionally may define and output a group named 'PageCount'. For example, if the document set
contains page numbering like 'Page 1 of 4', the following pattern would generate the required group names:
Page (?<PageNo>\d+) of (?<PageCount>\d+).
|
HasReferenceProperties |
System.Boolean |
Returns true if the object has properties which reference Grooper Node objects. |
ImageFeatureExtractor |
Grooper.IP.IpProfile |
An optional IP Profile to be used for extracting image-based features. |
IsEmpty |
System.Boolean |
Returns true if all properties with a ViewableAttribute are set to their default value. |
IsWriteable |
System.Boolean |
Returns true if the object is writable, or false if it is not. |
Owner |
Grooper.ConnectedObject |
Returns the node that owns the connected object, if any. |
OwnerNode |
Grooper.GrooperNode |
Returns the node that owns the connected object, if any. |
Root |
Grooper.GrooperRoot |
Returns the root node |
SmoothIDF |
System.Boolean |
Defines how the frequency of a feature across the set of document types impacts its weighting. When this value is false, the standard inverse document frequency (IDF) is used: IDF = Log(Classes / ClassesWithFeature). In this mode, a feature appearing in all classes
will have an IDF of 0.
When this value is true, +1 smoothing is added: IDF = Log(1 + Classes / ClassesWithFeature). The most notable impact of this is that the IDF can never reach zero,
even if the feature appears in every class. |
SubLinearScaling |
System.Boolean |
When enabled, term frequency values will be scaled logarithmically. Term frequency (TF) refers to the number of times a term occurs in a document. In many cases it seems unlikely that 20 occurrences of a term in a document truly carry 20 times the significance of a single occurrence. Accordingly, there
has been considerable research into variants of term frequency that go beyond counting the number of occurrences of a term. This option enables a common modification
to TF-IDF which uses the logarithm of the term frequency rather than the raw term frequency. |
TextFeatureExtractor |
Grooper.Core.EmbeddedExtractor |
Matches the features on a document which should be used for training-based classification. Each value matched by this extractor will be considered a feature. Training-based classification measures the similarity between a document and a Document Type by comparing the set of "features" which appear on
document to the set of features found on previously-trained examples.
Single WordsThe most common type of feature used in classification is a single word, or "unigram". When classifying on unigrams, there are a few important things to consider when building
a feature extractor:
- Features which appear more frequently are weighted higher. Because of this, it is important to filter out
stop words. Stop words are words such as "and" or "the" which appear frequently, but have little
value in a classification process. It is important that the feature extractor use a lexicon of stop words to the filter these out. When good stop word filtering is absent,
the frequencies of stop words can dominate the classification model, reducing Grooper's ability to distinguish between document types. There are many online sources for
stop word lists, and Grooper provides downloadable stop word lists in 30+ languages.
- Word stemming can produce better classification features. If the extractor has Porter Stemming enabled,
then each word found on the document will be reduced to a root form. For example, the words 'insurance', 'insure', 'insuring', 'insured', and 'insures' would all stem
down to the root word 'insur'. Stemming reduces the amount of training required, because Grooper does not have to learn on its own that every variation means the same thing.
nGram Features (bigrams, trigrams, etc.)In cases where word frequency alone cannot distinguish all document types in the corpus, nGram
features can be used. nGrams are sequences of words: a bigram is a pair of two adjacent words,
a trigram is a triplet of three adjacent words, and so on. While it is informative to
know that a document contains the word "well" and the word "oil", it is much more informative to the know that the document contains the phrase "oil well". nGram extraction can add significant processing overhead, and should only be used in cases where other methods have been deemed inadequate. Other Feature Types.Feature extractors can target more than just words - they can also target data and elements of natural language. This is accomplished by creating an extractor
which matches information on the document, but returns a text token in its place. For example, a document containing many VIN numbers is more likely to be a Vehicle
Inventory Sheet than a Sick Leave Request. Including a feature extractor which matches VIN numbers on t6e document and returns the token "VIN_Number" would allow
the classification engine to consider that fact that the document contains a lot of VIN numbers. |
UseCF |
System.Boolean |
When enabled, class frequency will be considered in lexical feature weightings. Class frequency (CF) refers to the number of times a feature appears. Modifies the underlying weighting mechanism from TF-IDF to TF-IDF-CF. See An improvement
of TF-IDF weighting in text categorization - Mingyong Liu and Jiangang Yang, Zhejiang University - 2012 International Conference on Computer Technology and Science. |
Methods
Method Name |
Description |
ClassifyFolder(Folder As BatchFolder, Scope As ContentType, Level As ClassificationLevel) As ContentTypeCandidateList |
Parameters |
Folder |
Type: BatchFolder |
|
|
Scope |
Type: ContentType |
|
|
Level |
Type: ClassificationLevel |
|
|
ClassifyPage(Page As BatchPage, Scope As ContentType) As PageTypeCandidateList |
Parameters |
Page |
Type: BatchPage |
|
|
Scope |
Type: ContentType |
|
|
ClearCache() |
|
ComparePage(pt As PageType, Features As FeatureDictionary) As Double |
Parameters |
pt |
Type: PageType |
|
|
Features |
Type: FeatureDictionary |
|
|
ExtractEpiData(Source As DataInstance, Result As PageTypeCandidateList) |
Parameters |
Source |
Type: DataInstance |
|
|
Result |
Type: PageTypeCandidateList |
|
|
ExtractFeatures(Page As IPage) As FeatureDictionary |
Parameters |
Page |
Type: IPage |
|
|
GenerateModel(ct As ContentType, Language As CultureData) As TfIdfModel |
Parameters |
ct |
Type: ContentType |
|
|
Language |
Type: CultureData |
|
|
GetCompiler(Scope As ContentType, Level As ClassificationLevel, Culture As CultureData) As TfIdfWeightingCompiler |
Parameters |
Scope |
Type: ContentType |
|
|
Level |
Type: ClassificationLevel |
|
|
Culture |
Type: CultureData |
|
|
GetProperties() As PropertyDescriptorCollection |
|
GetReferences() As List(Of GrooperNode) |
Returns a list of GrooperNode objects referenced in the properties of this object. |
IsPropertyEnabled(PropertyName As String) As Nullable(Of Boolean) |
Defines whether a property is currently enabled.
Parameters |
PropertyName |
Type: String |
The name of the property to determine the enabled state for. |
|
IsPropertyVisible(PropertyName As String) As Nullable(Of Boolean) |
Defines whether a property is currently visible.
Parameters |
PropertyName |
Type: String |
The name of the property to determine the visible state for. |
|
IsType(Type As Type) As Boolean |
Returns true if the object is of the type specified, or if it derives from the type specfied.
Parameters |
Type |
Type: Type |
The type to check. |
|
Serialize() As String |
Serializes the object. |
SetDatabase(Database As GrooperDb) |
Sets the database connection of the object.
Parameters |
Database |
Type: GrooperDb |
|
|
SetOwner(Owner As ConnectedObject, SkipInitialization As Boolean) |
Sets the owner of the connected object with another object that implements the IConnected interface.
Parameters |
Owner |
Type: ConnectedObject |
|
|
SkipInitialization |
Type: Boolean |
|
|
ToString() As String |
Returns a string value representation of the connected object. |
TrainPages(dt As DocumentType, Pages As IEnumerable(Of IPage), Culture As CultureData, ipd As IProgressDisplay) As List(Of PageType) |
Trains this document type with the provided document sample. The provided sample is a
list of pages which make up the document.
Parameters |
dt |
Type: DocumentType |
|
|
Pages |
Type: IEnumerable`1 |
The document to be trained. |
|
Culture |
Type: CultureData |
|
|
ipd |
Type: IProgressDisplay |
|
|
ValidateCache() |
|
ValidateProperties() As ValidationErrorList |
Validates the properties of the object, returning a list of validation errors. |