Overview

    The goal of this long-form tutorial is to guide you during a full handwritten or printed text recognition project, using our tools Arkindex & Callico.

    The basic assumptions for this project are:

    • You have images containing handwritten or printed text,
    • You are capable of reading the text on these images,
    • The number of images is too large for human-based transcription (more than 100 pages per person),
    • You want to retrieve the text from these images as a CSV, PDF, Word, raw text or XML file.

    Workflow🔗

    Below is an overview of all the tasks we will need to complete in order to convert your images into text.

    Tasks to follow for a full transcription project using Machine Learning
    Tasks to follow for a full transcription project using Machine Learning

    Arkindex and Callico access🔗

    Goal: Get an account on an Arkindex and Callico instance.

    You have several options to get an account on Arkindex & Callico instances:

    1. Create an account on publicly available instances, like our own demonstration ones:
    2. Self-host Arkindex and Callico on your own servers. Arkindex is available under an open-source license and a proprietary one. Self-hosting instructions are available for both Arkindex and Callico. We do not recommend this approach if you are not a seasoned system administrator.

    Once you have access to an Arkindex instance, you can follow these instructions to create an account.

    Project creation🔗

    Goal: Import images into a new Arkindex project.

    In this tutorial, we documented how to import data from a publicly available source: Europeana collections.

    But you can also import your own images using the Arkindex upload form for a few images, or ZIP archives similar to Transkribus exports for larger datasets.

    If you have rather large datasets (over 10 GB in size), please contact us directly to work out a solution for you on our own instances.

    Segmentation🔗

    Goal: Train a Machine Learning model to identify zones of interest on your images (lines, illustrations, ...).

    The first action on your images once they are available in Arkindex is to annotate them for segmentation. This will take you a bit of time, as you need to produce annotations for a small subset of your images (from 2 to 10% of the overall dataset).

    When the annotations are published on Arkindex, you can then use them to train a new segmentation Machine Learning model, that will later be used to identify positions of text lines and illustrations.

    Classification🔗

    Goal: Train a Machine Learning model to classify your images to avoid useless images (empty, no text, ...).

    Information

    This part of the tutorial is not available yet.

    Transcription🔗

    Goal: Train a Machine Learning model to transcribe the text found in your images.

    Information

    This part of the tutorial is not available yet.

    Production🔗

    Goal: Execute all these models on all your images to produce meaningful results.

    Information

    This part of the tutorial is not available yet, but you can look into how to run a process in the generic documentation.

    Export🔗

    Goal: Export all your data from your Arkindex project so you can use it elsewhere.

    Information

    This part of the tutorial is not available yet, but you can look into how to export a project as an SQLite database, or use our CLI tool export options (CSV, PDF, Alto XML, Word DOCX).