This page contains a glossary for the technical terms used in this tutorial. The defined words are organized in alphabetical order.


    An agent designates a specific Arkindex concept, which is an instance of Ponos, our proprietary software that executes workers linked to intensive document processing tasks.


    A CPU, or central processing unit, is a hardware component that is the core computational unit in a server. It handles all types of computing tasks required for the operating system and applications to run. A graphics processing unit (GPU) is a similar hardware component but more specialized and performant for demanding tasks, such as training Machine Learning models.

    Data leakage🔗

    In Machine Learning, data leakage is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment.

    See more on Wikipedia.


    An epoch corresponds to one complete pass of the training dataset through the algorithm. The performance of the model generally increases with the number of epochs. However, the model will eventually stop learning at some point so specifying a very high number might waste time.


    A farm is also a specific Arkindex concept. It refers to a group of computing resources on which a Ponos agent is available to run workers.


    Handwritten Text Recognition (HTR) is the ability of a computer to take as input handwriting from sources such as printed physical documents, pictures and other devices, and then interpret this as text.


    Markdown is a plain text format for writing structured documents, based on conventions which indicate the formatting. It is widely used for blogging, instant messaging, in collaborative software, etc. It easily enhances readability and allows formatting text as titles, subtitles, lists, links and so on.

    Learn more by reading the Markdown specification.


    A Machine Learning model is a program that has been trained to find patterns or make decisions from a previously unseen dataset.

    For example, in Natural Language Processing (NLP), Machine Learning models can parse and correctly recognize the intent behind previously unheard sentences or combinations of words. In image recognition, a Machine Learning model can be taught to recognize objects, such as cats or dogs.


    Slug is a term from newspaper language. It is a string that can only include characters, numbers, dashes, and underscores. It is a unique identifier that refers to a single object, in a human-friendly form.


    SQLite is a library that implements a transactional SQL database engine. In Arkindex, it allows to export projects and all the elements they contain to a single lightweight file. This generated file can be stored, shared and easily accessed by some workers to perform demanding operations on a large number of elements.

    Learn more by reading the SQLite documentation.


    Model training in Machine Learning is the process of feeding a Machine Learning algorithm with data to help identify and learn good values for all of its parameters.


    A worker is a resource required to run document processing workflows. It is programmed to apply a specialized action to one element at a time, which will produce the desired output. Workers can be chained to perform several successive actions on elements before reporting the results to Arkindex.

    For example, various dedicated workers from Arkindex have been developed with the aim of:

    • transcribing text from an image,
    • translating transcriptions available on Arkindex into another language,
    • recognizing objects in an image (cats, dogs and so on),
    • etc.

    Worker configuration🔗

    Configuring a worker is the step that allows you to parameterize its execution to adapt the program's input (elements) or output (results) to your needs.