Skip to content

Corpus import

In this tutorial, you will learn how to import images in Arkindex.

Corpus description

As an example, you will import the Pellet dataset from the Europeana 1914-1918 collection.
The corpus contains 471 scanned pages related to Casimir Marius PELLET, a French soldier during World War I. Each document has been transcribed by volunteers and includes descriptive metadata.

Info

Annotations from Europeana are available at page level. However, most Machine Learning models require line level annotations.
In this tutorial, we will show you how to create ground truth for text line segmentation and transcription, and how to train machine learning models from these annotations.

The pages are written in French and include various content types, such as campaign diaries, photographs, and postcards. We have selected this corpus as it covers a large variety of pages, as illustrated below.

Pages from the Pellet corpus

Of course you may import your own data directly in Arkindex, using file uploads. Arkindex supports images, PDFs, METS, ALTO, ZIP archive compatible with Transkribus, etc, see this section for more details.

Create a project in Arkindex

Info

This section expects you to have an Arkindex account. Learn how to register here.

Log in to Arkindex by entering your email and password.

On the front page, you will find an empty project entitled My Project. We will publish the data from Europeana in this project. Alternatively, you can create a new project by clicking on the New Project button at the top right of the page. Note that this project is personal and can only be accessed by you.

To edit your project name and and description:

  • Click on My Project
Select your project
  • Go on your project information page
Go to your project information page
  • Edit your project name and description and click on Update
  • Name: Europeana | Pellet
  • Description: Corpus from [Europeana](https://europeana.transcribathon.eu/documents/story/?story=121795)
Edit your project name and description

Info

The project Description field supports Markdown input.

Import data to Arkindex

For the purposes of this tutorial, we have prepared a ZIP archive, containing all the images from the Pellet corpus, which is freely available on our servers. You can download it directly from this link.

Once you have downloaded the data, you can import it to Arkindex. To do so, go to your project, then click on Import / Export > Import files.

Access the import page from your project

You will be redirected to a new page from where you can import files to Arkindex. Click on the Select files… button located next to the From local files label, and browse your file system to find the ZIP archive you just downloaded.

Add the downloaded archive to the files to import

The archive upload to Arkindex will take from a few seconds to a couple of minutes.

The archive is being uploaded

Once the archive is successfully uploaded to Arkindex, a green tick is displayed next to its name, in the list of Available files to import.

The archive was successfully uploaded and can be imported

It means that you can proceed to the next step and click the Import blue button available in the bottom-right corner of the current page.

You will be redirected to the Process status page, wait a bit for it to start (i.e. for its status to go from Unscheduled to Running). This process will extract the ZIP archive and upload every image it contains to Arkindex in a few moments.

The import process is running and extracting the archive

Once your process has ended (i.e. its status has changed to Completed), you can navigate back to your project to view the 471 imported images by clicking your project’s name under the Project label.

Navigate back to your project from the process page

From there, you should be able to browse through the newly created folder named europeana_pellet_images.zip:

Browse the images from the imported folder

You can also rename this folder to PELLET casimir marius (which is much nicer) by clicking the small pencil icon, next to its name, at the top-right corner of the page. Do not forget to validate your input by clicking the pencil blue button once you are done.

Rename the imported folder

Info

This import procedure is simplified and only allows you to import partial data from the Pellet corpus. This is sufficient for this tutorial, since we will only be using images.

However, the Pellet corpus is much more substantial, as it also contains a large amount of metadata and page level transcriptions from Europeana. If you wish to import this additional data, you can follow the advanced import tutorial at the bottom of this page.

Data partitioning

To train Machine Learning models, you first need to select a random sample of the corpus. In this tutorial, we will limit the sample to 100 documents to reduce the annotation effort. From this sample, you will create three sets for training, validation and evaluation.

  • 80 page elements (80% of the sample) in the train set (used for model training),
  • 10 page elements (10% of the sample) in the dev set (used for model validation),
  • 10 page elements (10% of the sample) in the test set (used for model evaluation).

Info

In a real HTR project, you would typically select a larger subset of the corpus, using the same partitioning strategy. For example you could sample 500 documents: 400 for training, 50 for validation and 50 for testing.

Dataset creation

First, you need to create an Arkindex dataset. To do that, go to your project, then click on Project > Project information > Datasets > +. Enter a description of this dataset, and set the following set names:

  • train,
  • dev,
  • test.
Create the dataset

Add pages to a dataset set

Arkindex offers the possibility to automatically populate your dataset. The images will be selected randomly from the folder. Checking the “Require unique elements among sets” ensures that there will be no data leakage between sets.

To select 100 pages and add them to the dataset, follow these steps:

  1. Go to the folder named PELLET casimir marius
  2. In the Processes menu, click on Populate a dataset
Populate the dataset
  1. Select the empty dataset you just created, and then select the type of elements you need.
    • Leave the Select child elements recursively toggle untoggled, because you only want the page elements for your dataset.
    • Under Filter by element types, click on Page. The dataset should only have Page elements.
  2. Under Number of elements, input 100 to select 100 pages at random.
  3. Under Distribution per set, select the proportion of images you need for each set.
    • 80% will go on train
    • 10% will go on dev
    • 10% will go on test This can be done either by using the cursors, or by typing in the numbers.
  4. Validate by clicking on Populate.
Populate the dataset with 100 pages

Your dataset is now populated.

Visualize your dataset

Click on Project > Project information > Datasets, then select your dataset to visualize its content:

The Pellet dataset imported in Arkindex

Next steps

You can now annotate text lines and illustrations. This will provide you ground truth data to train a segmentation model on Arkindex.


Optional section - Full data import to Arkindex

Warning

This section is intended for advanced users who wish to import the entire Pellet corpus into Arkindex (images, transcriptions and metadata).

The following instructions are NOT needed to proceed with the rest of this tutorial.

Moreover, importing page level transcriptions from Europeana will not reduce the workload of this tutorial. These are not enough to train a Machine Learning model to transcribe the text as they usually work at line level so we need both the location of the line (segmentation) and its transcription.

Two steps are required to import the Pellet corpus, in its entirety, to Arkindex:

  1. Extract the data from Europeana (images, transcriptions and metadata)
  2. Publish it to your Arkindex project

Info

You will need Python 3.10 and a shell environment (we recommend Ubuntu or Mac OS X)

We have released a Python package named arkindex-scrapers to help you achieve these steps. To install it to your environment, run:

pip install teklia-scrapers

Data extraction

To extract data from the Europeana website, you need to specify two arguments:

  • --story_id: the Europeana identifier for the Pellet corpus ("121795")
  • --output_dir: the directory in which the corpus will be extracted ("pellet_corpus")

Running the following command will start the import:

scrapers eu-trans --story_id 121795 --output_dir pellet_corpus/

Warning

The command should take about 2 hours to complete, depending on your network connection and the current availability of Europeana. If you do not have that much time, you can download the data directly from this link.

Once the extraction is done, you will find a JSON file named 121795.json in the directory named pellet_corpus/.

Publication to Arkindex

Then, you can use the scrapers publish command to publish the data to Arkindex.

You will need to provide the following arguments:

  • arkindex-api-url: The Arkindex instance in which you wish to import the corpus. By default, you should use https://demo.arkindex.org/.
  • arkindex-api-token: Your API token. If you do not know your API token, refer to this page.
  • --corpus-id: The UUID of the Arkindex project created in the previous step. This value can be copied from your Arkindex project details page, just below its name.
Find your project's UUID on Arkindex
  • --worker-run-id: The worker run UUID that will be used to import the data. Refer to this page to create your own worker run.
  • --folder-type: The type of the top level element ("folder")
  • --page-type: The type of the child level elements ("page")
  • --report: The path to the JSON report file ("report.json")
  • folder: The path to the local directory containing the 121795.json JSON file, generated using the previous command ("pellet_corpus/")
scrapers publish --folder-type folder \
                 --page-type page \
                 --report report.json \
                 --corpus-id aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee \
                 --arkindex-api-url https://demo.arkindex.org/ \
                 --arkindex-api-token aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee \
                 --worker-run-id aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
                 pellet_corpus/

Once the import is finished, you should be able to navigate through the folder named PELLET casimir marius in Arkindex:

An image from the Pellet corpus in Arkindex

Then, you can move back up on this page to follow this section where you will learn how to partition your data to create a dataset made up of three sets.

Optional section - Extended import capabilities

Transkribus collections

The procedure to import Transkribus collections, containing images and annotations, to Arkindex is documented here.

PAGE XML files

Warning

This section is intended for advanced users who wish to import their own data to Arkindex.

The following instructions are NOT needed to proceed with the rest of this tutorial.

If you want to import PAGE XML files to Arkindex, you can follow this documentation.