Skip to main content

Fillers

Fillers are configurable components used to extract and populate structured information from documents automatically. They define how specific data, such as tags, language, and templates, are identified and retrieved from document content.

Fillers are typically used in conjunction with document processing workflows to enable automated data extraction and reduce manual input.

Fillers Management

In Administration > Artificial Intelligence > Fillers, you can manage fillers from the dedicated panel.

The Filler panel provides a list of all configured fillers and allows you to:

  • View existing fillers
  • Create new fillers (Add filler)
  • Edit filler configuration
  • Delete fillers

Each filler is displayed with the following attributes:

  • Name: Internal identifier of the filler
  • Label: User-friendly name shown in the interface
  • Type: The extraction strategy used by the filler
  • Description: Additional details about the filler’s purpose
  • Overwrite: When checked, the filler overwrites the already filled information
  • Fill on check-in: When checked, the document is automatically filled at check-in

When you create a new filler by clicking on Add filler, you will be required to specify one of the available fillers.

Filler Properties

Each filler is configured through the filler Properties panel, where you define how the extraction is performed and which technologies are used.
The configuration is dynamic: some fields appear or change depending on the selected filler type and strategy.

At the time of writing, you can choose among these filler types:

  • Tag: Assigns a value to a field based on semantic similarity or classification
  • Language: Assigns the document language based on content analysis
  • Template: Assigns a document template based on semantic similarity
  • Chain: Assigns values by combining multiple fillers in sequence

Tag Filler

When the selected filler is of type Tag, you must define a Strategy that determines how the value is retrieved, along with a Threshold value that defines the minimum confidence required to accept a result. 

The available strategies are: AI Model and Semantic Similarity. Additionally, the selected strategy affects which additional fields are required.

  • AI Model (Model-based) 
    Uses a machine learning model to classify or extract the value directly
    • Model: for the selection of a specific AI model (This setting must reference a previously configured model)
  • Semantic Similarity (Retrieval-based)
    Uses vector similarity search over embeddings to retrieve the most relevant content
    • Embedding Scheme: for the selection of a specific embedding scheme (This setting must reference a previously configured embedding scheme).

Language Filler

When the selected filler is of type Language, you must define a Model to specify the AI model used for extraction.

Template Filler

A filler of type template does not present any extra required fields. 

The Template Filler is a specialized filler used to automatically assign a template to a document based on its content.
Instead of extracting a single field, this filler performs semantic classification, identifying which template best matches the document.

Chain Filler

When the selected filler is of type Language, you must define a chain of fillers in the table on the right side of the panel. The filler acts as a pipeline of multiple fillers executed in sequence.

In this case, the table allows you to:

  • Add fillers to the chain
  • Reorder execution (drag & drop)
  • Remove fillers

Tokens Detector

The Tokens Detector is a specialized natural language model designed to identify and extract specific pieces of information, called tokens, from within a sentence. It is used to detect structured data like IDs, names, dates, or codes embedded in natural language text.

Unlike a classifier, which categorizes entire sentences, the tokens detector focuses on marking and labeling subsections of a sentence. This is particularly useful for applications that need to extract values from user requests, such as document IDs or references.

How the Tokens Detector Works

The Tokens Detector is trained using labeled sentences in which tokens of interest are clearly marked using a special format. Each token to be detected is wrapped using <START:label> and <END> tags. These indicate both the boundaries of the token and the type of information being extracted. Specifically, these tags (<START:label> and <END>) teach the system how to recognize similar structures and values in new, unseen text.

Examples: 

I am searching for the document with id <START:docId> 12356897 </START:docId>, please send it to me.
All employees are encouraged to find doc with id <START:docid> 0023 <END> on the HR portal.
Interested faculty and graduate students can find document with id <START:docid> 1250 <END>. Please open file titled <START:filename> launch_brief.txt <END>.
The revised pipeline is presented in the document <START:filename> ingestion_workflow_diagram_v2.pdf <END>.
To explore the many benefits of developing a consistent reading habit, see the document called <START:filename> reading-benefits.txt <END>.  Please, retrieve any documents about <START:expression> 1250 <END> paper .
Locate the finance policy document containing <START:expression> "card*" <END> usage rules and restrictions.
I need you to find all compliance documentation that specifies <START:expression> dev* <END> for our legal counsel.

As shown above, the detector uses labeled examples where the tokens are tagged explicitly. Therefore, It learns how to recognize these token types based on:

  • Word position
  • Word shape (numbers, uppercase, etc.)
  • Context around the token

Token Detection

When a sentence is submitted to the model, it first breaks it into tokens (typically words or punctuation marks) using a language-specific tokenizer. This step is essential for accurate recognition of token boundaries. The trained model scans the tokenized sentence and identifies spans of words that match learned patterns. Each span is returned along with:

  • The label of the token (e.g., docid)
  • The value (e.g.,12356897)
  • A confidence score

Classifier

The Classifier is a natural language component that assigns a category to a given text based on its content. In this system, the classifier is trained using pairs of example data, where each pair contains a category label and a sample sentence that ends with a space followed by a period ( .). This allows the classifier to learn patterns and keywords that are commonly associated with specific intents or commands.

How the Classifier Works

The Classifier is trained using labeled examples, where each line includes a category and a sentence that reflects that category. This builds a model that can later compare new inputs against what it has learned.
To train the model, the system expects a CSV file containing only one column. Each line in this column must follow a very specific format:

<category><TAB><text ending with a space followed by a period>. 

Examples:

SEARCHDOC    Find any files about budget .
SEARCHDOC    Locate docs about paper .
SEARCHDOC    Retrieve documents matching "news" .
GETDOC    Can you get doc with ID 1233587 .
GETDOC    I need to access file with id 29679 .
GETDOC    Show doc with id 299 .
SEARCHFILE    File called "mywork.docx" .
SEARCHFILE    Get document called invoice.pdf .
SEARCHFILE    Open doc titled booklet.txt .

Classifier Configuration Overview

This section describes the key configuration fields for the Classifier model used in the NLP system. These settings define how the classifier behaves during training and how it interprets user input.

Properties

The Properties tab in the classifier interface contains the core configuration settings that define how the classifier behaves during training and inference. These parameters influence how the input text is processed, how features are extracted, and how language-specific rules are applied. Properly configuring these fields ensures accurate and efficient classification.

classifier_properties_specs
 

  • Cutoff: a threshold value used during training to filter low-probability features. A lower value means more features are used; a higher value makes the model stricter.
  • Ngram Min: the minimum size of n-grams (word sequences) to consider during training (e.g., 2 = bigrams).
  • Ngram Max: the maximum size of n-grams to include during training (e.g., 4 = up to 4-word sequences).
  • Language: the language of the training dataset (e.g., English). This helps the system load appropriate stop words and processing rules.

Language Detector

The Language Detection model automatically identifies the language of a document based on its textual content. This enables the system to classify documents by language and support language-specific processing workflows.

The model is based on a pre-trained implementation provided by Apache OpenNLP and does not require training within the system.

How the Language Detection Model Works

The model analyzes the input text and predicts the most likely language using statistical patterns learned from large multilingual datasets.

Test the Model

To quickly test the model, you can right-click on the model, and select Query the Model, and fill the required fill (Content).

For this example, the content used is:

Ich lehre euch den Übermenschen. Der Mensch ist etwas, das überwunden werden soll. Was habt ihr getan, ihn zu überwinden? 
... Alles Wesen bisher schuf etwas über sich hinaus; und ihr wollt die Ebbe dieses grossen Schwindens sein und lieber noch zum Tiere zurückgehen, 
als den Menschen überwinden?

The detected language is returned together with its confidence score.

Embedder

The Embedder class models enable the mapping of entire documents to fixed-length vectors, making it possible their representation in a continuous vector space. This facilitates efficient comparison and manipulation of textual data in natural language processing (NLP) like semantic searches.

How the Embedder Works

The transformation of a document's content into a vector is realized through the Doc2Vec algorithm, this manual is not the right place to treat details of this technique, but you may find a lot of literature about this topic.

In a few words, the Doc2Vec makes use of a particular neural network to create a numerical representation of a document (the vector) that in general will be stored into a Vector Store, similar documents will be represented by two different yet adjacent vectors in the multidimensional vector space.

The Embedder is trained using paragraphs of text written in natural language, each paragraph terminated by a dot followed by a blank line:

Example:

She quickly dropped it all into a bin, closed it with its wooden
lid, and carried everything out. She had hardly turned her back
before Gregor came out again from under the couch and stretched
himself.

This was how Gregor received his food each day now, once in the
morning while his parents and the maid were still asleep, and the
second time after everyone had eaten their meal at midday as his
parents would sleep for a little while then as well, and Gregor's
sister would send the maid away on some errand. Gregor's father and
mother certainly did not want him to starve either, but perhaps it
would have been more than they could stand to have any more
experience of his feeding than being told about it, and perhaps his
sister wanted to spare them what distress she could as they were
indeed suffering enough.

Embedder Configuration Overview

This section describes the key configuration fields for the Embedder model. These settings define how the embedder behaves during training and how it interprets user input.

Properties

The Properties tab in the embedder interface contains the core configuration settings that define how the embedder behaves during training and embedding. These parameters influence how the input text is processed.

classifier_properties_specs
 

  • Seed: just a value used for random numbers generation
  • Workers: number of threads used for training
  • Window size: size of the window used by the Doc2Vec algorithm
  • Vector Size: number of elements in each single vector, should be greater than 300
  • Min. word freq: words that appear less than this number will be discarded
  • Max chunks: each document is subdivided into chunks of tokens, here you give a maximum number of admitted chunks
  • Chunk size: target number of tokens into a single chunk
  • Min. chunk size: minimum number of tokens and characters into a single chunk
  • Alpha: the initial learning rate(the size of weight updates in a machine learning model during training), default is 0.025
  • Min Alpha: learning rate will linearly drop to min alpha over all inference epochs, default is 0.0001