Skip to main content

Classifier

The Classifier is a natural language component that assigns a category to a given text based on its content. In this system, the classifier is trained using pairs of example data, where each pair contains a category label and a sample sentence that ends with a space followed by a period ( .). This allows the classifier to learn patterns and keywords that are commonly associated with specific intents or commands.

How the Classifier Works

The Classifier is trained using labeled examples, where each line includes a category and a sentence that reflects that category. This builds a model that can later compare new inputs against what it has learned.
To train the model, the system expects a CSV file containing only one column. Each line in this column must follow a very specific format:

<category><TAB><text ending with a space followed by a period>. 

Examples:

SEARCHDOC    Find any files about budget .
SEARCHDOC    Locate docs about paper .
SEARCHDOC    Retrieve documents matching "news" .
GETDOC    Can you get doc with ID 1233587 .
GETDOC    I need to access file with id 29679 .
GETDOC    Show doc with id 299 .
SEARCHFILE    File called "mywork.docx" .
SEARCHFILE    Get document called invoice.pdf .
SEARCHFILE    Open doc titled booklet.txt .

Classifier Configuration Overview

This section describes the key configuration fields for the Classifier model used in the NLP system. These settings define how the classifier behaves during training and how it interprets user input.

Properties

The Properties tab in the classifier interface contains the core configuration settings that define how the classifier behaves during training and inference. These parameters influence how the input text is processed, how features are extracted, and how language-specific rules are applied. Properly configuring these fields ensures accurate and efficient classification.

classifier_properties_specs
 

  • Cutoff: a threshold value used during training to filter low-probability features. A lower value means more features are used; a higher value makes the model stricter.
  • Ngram Min: the minimum size of n-grams (word sequences) to consider during training (e.g., 2 = bigrams).
  • Ngram Max: the maximum size of n-grams to include during training (e.g., 4 = up to 4-word sequences).
  • Language: the language of the training dataset (e.g., English). This helps the system load appropriate stop words and processing rules.