Lookup Options

Defines lookup options which should be applied to extraction results.

Properties

The following 11 properties are defined.

Property Name Description
General
Vocabulary Type: Embedded Lexicon

Defines an optional set of allowed values. If a vocabulary is defined, then any result which does not occur in vocabulary will be discarded.

Exclusions Type: Embedded Lexicon

Defines an optional set of disallowed values. Any extracted value appearing in this list will be discarded.

Clean Key Type: Boolean, Default: False

If enabled, vocabulary lookups will be performed with all punctuation symbols and control characters removed. As an example, this option could be used to match O'Connor in a lexicon which contains 'oconnor'.

Enable Translation Type: Boolean, Default: False

If enabled, values will be translated to the replacement values specified in the vocabulary. Vocabulary entries may consist of key-value pairs, using the = symbol as a delimiter. For example, the vocabulary entry OK=Oklahoma indicates that if the value "OK" is found, it should be translated to "Oklahoma". If the vocabulary entry does not specify a replacement value, then no translation will be performed.

Match Case Type: Boolean, Default: False

If enabled, the case of the extracted value will be detected, and the detected casing will be applied to the translated output value.

Porter Stemming Type: Boolean, Default: False

If enabled, the final result will be lower-cased and stemmed to its root form using Porter Stemming. This property affects english documents only. Stemming is the process of reducing inflected words to their word stem, base or root form. Stemming is useful when extracting features for use in classification of documents or data elements. Below are some stemming examples:

  • The strings "cats", "catlike", and "catty" reduce to "cat".
  • The strings "stems", "stemmer", "stemming", "stemmed" reduce to "stem".
  • The strings "fishing", "fished", and "fisher" reduce to "fish".
  • The strings "argue", "argued", "argues", "arguing", and "argus" reduce to "argu" (illustrating the case where the stem is not itself a word or root) but "argument" and "arguments" reduce to the stem "argument".
  • Fuzzy Lookup Options
    Fuzzy Match Similarity Type: Double, Default: 100%, Range: 0% - 100%

    The percentage of similarity required for a fuzzy match. A value of 100% will disable fuzzy matching. Controls how similar a fuzzy match candidate must be to the extracted value in order for a replacement to occur.

    Fuzzy Match Minimum Length Type: Int32, Default: 0, Range: 0 - 512

    The minimum length of values that will be considered for a fuzzy match. Any value shorter than the configured minimum will not be submitted for fuzzy matching.

    Fuzzy Match Depth Type: Int32, Default: 0

    If set to a value other than 0, specifies that only the top N entries in the lexicon will be considered for fuzzy matching purposes. If set to 0, fuzzy matching will be performed for all entries in the lexicon. NOTE: Depth limits work best when applied to vocabulary lexicons which are sorted in descending order by frequency.

    Fuzzy Match Weightings Type: Fuzzy Match Weightings

    Defines weightings to be used for fuzzy lookups.

    Fuzzy Match Vocabulary Type: Embedded Lexicon

    Defines a vocabulary to be used in place of the main vocabulary for fuzzy matching. By default, when fuzzy matching is enabled, the main vocabulary is used for fuzzy matching. However, if the main vocabulary is large, it may be desirable for performance reason to restrict fuzzy matching to a smaller list of key values. In such cases, this property can be used to override the set of lexicon enties used for fuzzy matching.

    See Also

    Embedded Lexicon, Fuzzy Match Weightings

    Used By

    Data Pattern