Tokens Detector

The Tokens Detector is a specialized natural language model designed to identify and extract specific pieces of information, called tokens, from within a sentence. It is used to detect structured data like IDs, names, dates, or codes embedded in natural language text.

Unlike a classifier, which categorizes entire sentences, the tokens detector focuses on marking and labeling subsections of a sentence. This is particularly useful for applications that need to extract values from user requests, such as document IDs or references.

How the Tokens Detector Works

The Tokens Detector is trained using labeled sentences in which tokens of interest are clearly marked using a special format. Each token to be detected is wrapped using <START:label> and <END> tags. These indicate both the boundaries of the token and the type of information being extracted. Specifically, these tags (<START:label> and <END>) teach the system how to recognize similar structures and values in new, unseen text.

Examples:

I am searching for the document with id <START:docId> 12356897 </START:docId>, please send it to me.
All employees are encouraged to find doc with id <START:docid> 0023 <END> on the HR portal.
Interested faculty and graduate students can find document with id <START:docid> 1250 <END>.
Please open file titled <START:filename> launch_brief.txt <END>.
The revised pipeline is presented in the document <START:filename> ingestion_workflow_diagram_v2.pdf <END>.
To explore the many benefits of developing a consistent reading habit, see the document called <START:filename> reading-benefits.txt <END>. 
Please, retrieve any documents about <START:expression> 1250 <END> paper .
Locate the finance policy document containing <START:expression> "card*" <END> usage rules and restrictions.
I need you to find all compliance documentation that specifies <START:expression> dev* <END> for our legal counsel.

As shown above, the detector uses labeled examples where the tokens are tagged explicitly. Therefore, It learns how to recognize these token types based on:

Word position
Word shape (numbers, uppercase, etc.)
Context around the token

Token Detection

When a sentence is submitted to the model, it first breaks it into tokens (typically words or punctuation marks) using a language-specific tokenizer. This step is essential for accurate recognition of token boundaries. The trained model scans the tokenized sentence and identifies spans of words that match learned patterns. Each span is returned along with:

The label of the token (e.g., docid)
The value (e.g.,12356897)
A confidence score