Overview

    The goal of this long-form tutorial is to guide you during a full handwritten or printed text recognition project, using our products, Arkindex & Callico.

    The basic assumptions for this project are:

    • You have images containing handwritten or printed text,
    • You are capable of reading the text on these images,
    • The number of images is too large for human-based transcription (more than 100 pages per person),
    • You want to retrieve the text from these images as PAGE XML files.

    Workflow🔗

    Below is an overview of all the tasks we will need to complete in order to convert your images into text.

    Tasks to follow for a full transcription project using Machine Learning
    Tasks to follow for a full transcription project using Machine Learning

    Arkindex and Callico access🔗

    Goal: Get an account on an Arkindex and a Callico instance.

    You have several options to get an account on Arkindex & Callico instances:

    1. Create an account on publicly available instances, like our own demonstration ones:
    2. Self-host Arkindex and Callico on your own servers. Arkindex is available under an open-source license and a proprietary one. Self-hosting instructions are available for both Arkindex and Callico. We do not recommend this approach if you are not a seasoned system administrator.

    Once you have access to an Arkindex instance, you can follow these instructions to create an account.

    Project creation🔗

    Goal: Import images into a new Arkindex project.

    In this tutorial, we documented how to import data from a publicly available source: Europeana collections.

    But you can also import your own images using the Arkindex upload form for a few images, or ZIP archives similar to Transkribus exports for larger datasets.

    If you have rather large datasets (over 10 GB in size), please contact us directly to work out a solution for you on our own instances.

    Classification🔗

    Goal: Train a Machine Learning model to classify your images to avoid useless images (empty, without text, etc).

    Information

    This part of the tutorial is not available yet.

    Segmentation🔗

    Goal: Train a Machine Learning model to identify zones of interest on your images (lines, illustrations, ...).

    The first action on your images once they are available in Arkindex is to annotate them for segmentation on Callico. This will take you a bit of time, as you need to produce annotations for a random subset of your images (at least 100 documents).

    When the annotations are published back to Arkindex, you can then use them to train a new segmentation Machine Learning model, that will later be used to identify the position of text lines and illustrations.

    Transcription🔗

    Goal: Train a Machine Learning model to transcribe the text found in your images.

    Once text lines are segmented from your images and are available in Arkindex, you have to transcribe them on Callico. This will take you a bit of time, as you need to produce annotations for a random subset of your images (at least 100 documents).

    When the transcriptions are published back to Arkindex, you can then use them to train a new transcription Machine Learning model, that will later be used to transcribe the text on your images.

    Production🔗

    Goal: Execute all these models on all your images to produce meaningful results.

    Once you have trained a segmentation and a transcription model using Arkindex, you will be able to run them on your images to produce meaningful results.

    Export🔗

    Goal: Export all your data from your Arkindex project so you can use it elsewhere.

    When segmentation and transcription predictions, from your own models, have been added to your project, you can easily export them, out of Arkindex, in a PAGE XML format.

    Information

    You can also look into how to use our advanced CLI tool export options to export your results as CSV, PDF, ALTO XML or Word DOCX.