Training a transcription Model

    In this tutorial, you will learn how to train a transcription model in Arkindex.

    This section is to be followed and carried out after creating the ground-truth annotations.

    When annotating on Callico, you reused the Arkindex Pellet dataset you created earlier. You may have already made this dataset immutable while training a segmentation model:

    Optional step - Clone an existing dataset🔗

    An immutable dataset is locked to avoid any issues, such as data leakage or accidental deletions, it is associated to a snapshot of the elements it contains. Therefore, when a dataset is already locked and new data is added to its elements (e.g. through the export of results from Callico), it will not be added to the snapshot which is unalterable.

    In the previous tutorial, you gathered annotations on Callico and exported them back to Arkindex. If the dataset you were working with is already locked, we need to clone it in a mutable one.

    To do so, navigate to your Arkindex project details page, then, open the Datasets tab and click on your dataset name:

    Access the dataset details page on Arkindex
    Access the dataset details page on Arkindex

    You will be redirected to the details page of the Pellet dataset. There, click on the Clone blue button:

    Clone the dataset into a new one
    Clone the dataset into a new one

    A modal will open, asking you to confirm you wish to clone the dataset, approve by hitting the Clone blue button once again. Upon approval, you will reach the details page of a new dataset named Clone of Pellet, which state will be Open!

    Your dataset was successfully cloned
    Your dataset was successfully cloned

    For convenience, we will rename your dataset with a more explicit name. First, navigate back to your project details page and open the Datasets tab. Then, click on the pencil button, in the Actions column of your cloned dataset:

    Edit your new dataset
    Edit your new dataset

    A modal will appear, from there, you will be able to rename your dataset, we can input Pellet - HTR. To save this new name, click on the Save blue button at the bottom of the modal:

    Rename your cloned dataset
    Rename your cloned dataset

    Make your dataset immutable🔗

    The steps to lock a dataset are explained in great detail in the segmentation training tutorial. Follow them to make your dataset (Pellet or Pellet - HTR) immutable.

    Once your dataset is properly locked, you can proceed with the next section of this tutorial.

    Create a model🔗

    The training will save the model's files as a new version on Arkindex. In this section, we will create the model that will hold this new version.

    Click on Models in the top right dropdown (with your email address).

    Browse to the Models page
    Browse to the Models page

    Click on the Create a model button to create a new model. This will open a new page where you can fill your model's information. It is a good idea to name the model after:

    • the Machine Learning technology used,
    • the dataset,
    • the type of element present in the dataset.

    In our case, we are training:

    • a PyLaia model,
    • on the Pellet - HTR dataset,
    • on text_line elements.

    A suitable name would be PyLaia | Pellet - HTR (text line). In the description, you can add a link towards the dataset on Europeana. The description supports Markdown input.

    Create a new model
    Create a new model
    Information

    A model can hold multiple versions. If you do another training under different conditions (for a longer period of time, ...), you do not have to create a brand new model again.

    Start your training process🔗

    Now that you have a dataset and a model, we can create the training process.

    Create a process using all sets of the dataset. The procedure is the same as before, when we locked the dataset.

    The state of the dataset has changed, you should now have the following process selection.

    The dataset and its sets are selected
    The dataset and its sets are selected

    Proceed to workers configuration. Press the Select workers button, search for Pylaia Training and press the Enter keyboard key.

    Search for the Pylaia Training worker
    Search for the Pylaia Training worker

    Click on the name of the worker on the left and select the first version listed by clicking on the button in the Actions column.

    Add the Pylaia Training worker to the process
    Add the Pylaia Training worker to the process

    Close the modal by clicking on the Done button on the bottom right.

    To improve your future model performances, we will train on top of an existing and publicly available model. This is called fine-tuning. To do so, click on the button in the Model version column of the PyLaia Training worker. In the modal that opens:

    1. Pick the PyLaia Hugin Munin model name,
    2. Add the first listed model version by clicking on Use in the Actions column,
    3. Close the modal by clicking on Ok, in the bottom right corner.
    Add the model to fine-tune to the training process
    Add the model to fine-tune to the training process

    Then, configure the Pylaia Training worker by clicking on the button in the Configuration column. This will open a new modal, where you can pass specific parameters used for training. The full description of the fields is available on the worker's description page.

    Worker's description
    Worker's description

    Select New configuration on the left column, to create a new configuration. Again, name it after the dataset you are using.

    Configure the worker
    Configure the worker

    The most important parameters are:

    • Element type to extract: add the slug of the element type which contains the transcriptions,
    • Model that will receive the new trained version: search for the name of your model,
    • Training setup | List of layers to freeze during training: click on the + button and add two elements to the list: conv and rnn. This will freeze the first layers of the model, reducing the number of parameters to train. This is necessary in our case to avoid overfitting since the training dataset is quite small.
    • Training setup | Max number of epochs: the default value is good enough but you can set it to a larger number if you want to train for a longer period of time,

    Click on Create then Select when you are done filling the fields. Your process is ready to go.

    Configured process, ready to launch
    Configured process, ready to launch

    Click on the Run process button to launch the process.

    Training process is running
    Training process is running

    While it is running, the logs of the tasks are displayed. Multiple things happen during this process:

    1. The dataset is converted into the right format for PyLaia models,
    2. A PyLaia model is created,
    3. Training starts, for as long as needed,
    4. The new model is published on Arkindex.
    Information

    During training, you may encounter warnings such as:

    2024-07-24 11:03:19,927 WARNING/laia.losses.ctc_loss: The following samples in the batch were ignored for the loss computation: ['/tmp/tmp4e_5gt4d-train/images/train/d177a42d-f082-4fb2-8323-0ced4a230acf/9889d180-3b92-4479-b025-84763aae5a6a']
    
    

    This is the expected behavior since PyLaia ignores vertical lines during training.

    When the process is finished, visit the page of your model to see your brand new trained model version. To do so, browse the Models page and search for your model.

    The new model version is displayed under the model
    The new model version is displayed under the model

    You can download it to use it on your own or you can use it to process pages already on Arkindex, as described in the next section.

    Evaluation by inference🔗

    Graphs are nice to get an idea of how the model performs on unknown data. During training, the model saw all images from both the train and val partitions. The test partition is made up of images unknown to the model.

    It is easier to make yourself an idea when the predictions are actually displayed. In this section, you will learn to process the test set of your dataset with your newly trained model.

    Creating the process🔗

    Browse to the folder containing the elements of the dataset, in the project you created in the earlier steps of the tutorial.

    Folders containing the elements of the dataset
    Folders containing the elements of the dataset

    Click on the test folder. Elements in the test set will be displayed. Then click on Create process in the Actions menu.

    Create a process on the `test` set
    Create a process on the `test` set

    Filter element by type page and trigger the Load children toggle to display page elements.

    Filter the elements to process to list pages
    Filter the elements to process to list pages

    Click on Configure workers to move on to worker selection. Press the Select workers button, search for PyLaia Generic and press the Enter keyboard key. Just like we did in the previous sections, click on the name of the worker on the left and select the first version listed by clicking on the button in the Actions column.

    Add the PyLaia Generic worker to the process
    Add the PyLaia Generic worker to the process

    Close the modal by clicking on the Done button on the bottom right.

    Now it is time to select the model you trained. Click on the button in the Model version column. In the modal that opens:

    1. Trigger the Show all models toggle,
    2. Look for the name of your trained model,
    3. Add the model version by clicking on Use in the Actions column,
    4. Close the modal by clicking on Ok, in the bottom right corner.
    Add your trained model to the process
    Add your trained model to the process

    The process is ready and you can launch it using the Run process button. Wait for its completion before moving to the next step.

    Visualizing predictions🔗

    To see the predictions of your model, browse back to the test folder in your project. There you can click on one of the displayed pages and highlight a text line by selecting it from the children tree displayed on the left.

    Highlight a text line element on a page
    Highlight a text line element on a page

    On all text lines of the test set, you can see several transcriptions, either coming from the annotations on Callico or the model's predictions.

    A text line transcribed by a human that also have a predicted transcription
    A text line transcribed by a human that also have a predicted transcription

    On the transcriptions annotated by humans, Callico is mentioned. On the predicted transcriptions, PyLaia is mentioned. The confidence score of the PyLaia prediction is also displayed.

    Information

    In this tutorial, we do not calculate evaluation scores for this transcription model as it would require you to run scoring tools using sophisticated procedures outside the Arkindex and Callico frameworks.

    Next step🔗

    If the model's initial results are close to the ground truth annotations, you could use it on all your pages through a dedicated process.