Corpus import

In this tutorial, you will learn how to import images in Arkindex.

Corpus description¶

As an example, you will import the Pellet dataset from the Europeana 1914-1918 collection.
The corpus contains 471 scanned pages related to Casimir Marius PELLET, a French soldier during World War I. Each document has been transcribed by volunteers and includes descriptive metadata.

Info

Annotations from Europeana are available at page level. However, most Machine Learning models require line level annotations.
In this tutorial, we will show you how to create ground truth for text line segmentation and transcription, and how to train machine learning models from these annotations.

The pages are written in French and include various content types, such as campaign diaries, photographs, and postcards. We have selected this corpus as it covers a large variety of pages, as illustrated below.

Of course you may import your own data directly in Arkindex, using file uploads. Arkindex supports images, PDFs, METS, ALTO, ZIP archive compatible with Transkribus, etc, see this section for more details.

Create a project in Arkindex¶

Info

This section expects you to have an Arkindex account. Learn how to register here.

Log in to Arkindex by entering your email and password.

On the front page, you will find an empty project entitled My Project. We will publish the data from Europeana in this project. Alternatively, you can create a new project by clicking on the New Project button at the top right of the page. Note that this project is personal and can only be accessed by you.

To edit your project name and and description:

Click on My Project

Go on your project information page

Edit your project name and description and click on Update
Name: Europeana | Pellet
Description: Corpus from [Europeana](https://europeana.transcribathon.eu/documents/story/?story=121795)

Info

The project Description field supports Markdown input.

Import data to Arkindex¶

For the purposes of this tutorial, we have prepared a ZIP archive, containing all the images from the Pellet corpus, which is freely available on our servers. You can download it directly from this link.

Once you have downloaded the data, you can import it to Arkindex. To do so, go to your project, then click on Import / Export > Import files.

Access the import page from your project

You will be redirected to a new page from where you can import files to Arkindex. Click on the Select files… button located next to the From local files label, and browse your file system to find the ZIP archive you just downloaded.

Add the downloaded archive to the files to import

The archive upload to Arkindex will take from a few seconds to a couple of minutes.

Once the archive is successfully uploaded to Arkindex, a green tick is displayed next to its name, in the list of Available files to import.

The archive was successfully uploaded and can be imported

It means that you can proceed to the next step and click the Import blue button available in the bottom-right corner of the current page.

You will be redirected to the Process status page, wait a bit for it to start (i.e. for its status to go from Unscheduled to Running). This process will extract the ZIP archive and upload every image it contains to Arkindex in a few moments.

The import process is running and extracting the archive

Once your process has ended (i.e. its status has changed to Completed), you can navigate back to your project to view the 471 imported images by clicking your project’s name under the Project label.

Navigate back to your project from the process page

From there, you should be able to browse through the newly created folder named europeana_pellet_images.zip:

Browse the images from the imported folder

You can also rename this folder to PELLET casimir marius (which is much nicer) by clicking the small pencil icon, next to its name, at the top-right corner of the page. Do not forget to validate your input by clicking the pencil blue button once you are done.

Info

This import procedure is simplified and only allows you to import partial data from the Pellet corpus. This is sufficient for this tutorial, since we will only be using images.

However, the Pellet corpus is much more substantial, as it also contains a large amount of metadata and page level transcriptions from Europeana. If you wish to import this additional data, you can follow the advanced import tutorial at the bottom of this page.

Data partitioning¶

To train Machine Learning models, you first need to select a random sample of the corpus. In this tutorial, we will limit the sample to 100 documents to reduce the annotation effort. From this sample, you will create three sets for training, validation and evaluation.

80 page elements (80% of the sample) in the train set (used for model training),
10 page elements (10% of the sample) in the dev set (used for model validation),
10 page elements (10% of the sample) in the test set (used for model evaluation).

Info

In a real HTR project, you would typically select a larger subset of the corpus, using the same partitioning strategy. For example you could sample 500 documents: 400 for training, 50 for validation and 50 for testing.

Dataset creation¶

First, you need to create an Arkindex dataset. To do that, go to your project, then click on Project > Project information > Datasets > +. Enter a description of this dataset, and set the following set names:

train,
dev,
test.

Add pages to a dataset set¶

Arkindex offers the possibility to automatically populate your dataset. The images will be selected randomly from the folder. Checking the “Require unique elements among sets” ensures that there will be no data leakage between sets.

To select 100 pages and add them to the dataset, follow these steps:

Go to the folder named PELLET casimir marius
In the Processes menu, click on Populate a dataset

Select the empty dataset you just created, and then select the type of elements you need.
- Leave the Select child elements recursively toggle untoggled, because you only want the page elements for your dataset.
- Under Filter by element types, click on Page. The dataset should only have Page elements.
Under Number of elements, input 100 to select 100 pages at random.
Under Distribution per set, select the proportion of images you need for each set.
- 80% will go on train
- 10% will go on dev
- 10% will go on test This can be done either by using the cursors, or by typing in the numbers.
Validate by clicking on Populate.

Your dataset is now populated.

Visualize your dataset¶

Click on Project > Project information > Datasets, then select your dataset to visualize its content:

Next steps¶

You can now annotate text lines and illustrations. This will provide you ground truth data to train a segmentation model on Arkindex.

Optional section - Full data import to Arkindex¶

Warning

This section is intended for advanced users who wish to import the entire Pellet corpus into Arkindex (images, transcriptions and metadata).

The following instructions are NOT needed to proceed with the rest of this tutorial.

Moreover, importing page level transcriptions from Europeana will not reduce the workload of this tutorial. These are not enough to train a Machine Learning model to transcribe the text as they usually work at line level so we need both the location of the line (segmentation) and its transcription.

Two steps are required to import the Pellet corpus, in its entirety, to Arkindex:

Extract the data from Europeana (images, transcriptions and metadata)
Publish it to your Arkindex project

Info

You will need Python 3.10 and a shell environment (we recommend Ubuntu or Mac OS X)

We have released a Python package named arkindex-scrapers to help you achieve these steps. To install it to your environment, run:

pip install arkindex-scrapers

Data extraction¶

To extract data from the Europeana website, you need to specify two arguments:

--story_id: the Europeana identifier for the Pellet corpus ("121795")
--output_dir: the directory in which the corpus will be extracted ("pellet_corpus")

Running the following command will start the import:

scrapers eu-trans --story_id 121795 --output_dir pellet_corpus/

Warning

The command should take about 2 hours to complete, depending on your network connection and the current availability of Europeana. If you do not have that much time, you can download the data directly from this link.

Once the extraction is done, you will find a JSON file named 121795.json in the directory named pellet_corpus/.

Publication to Arkindex¶

Creating a user worker run¶

All data published on Arkindex that is not manually created needs a worker run UUID. To publish your data using the scrapers publish command, you must create a user worker run.

First, click on Worker runs in the e-mail address dropdown menu on Arkindex.

Then, click on the Create a worker run button in the top-right corner. This opens a modal in which you must select a worker version.

On the Arkindex Demo instance, a worker is available for the purpose of this tutorial: use the name filter / search bar on top of the workers list (left column) to look for the Arkindex CLI worker. Clicking on the worker in the list displays its details in the right-hand side panel. In that panel, open the Versions tab and select the first available worker version by clicking on the green + button.

On your own instance, you should create a new worker and worker version, following the user worker runs documentation. You can then select the created worker version by clicking on the green + button.

Select a worker version to create a new worker run

A success notification informs you that the worker run was successfully created. You can copy the worker run UUID from that notification, or close the modal and find the created worker run in the list of your worker runs.

Copy the worker run UUID from the worker runs list

Publishing the data to Arkindex¶

Then, you can use the scrapers publish command to publish the data to Arkindex.

You will need to provide the following arguments:

arkindex-api-url: The Arkindex instance in which you wish to import the corpus. By default, you should use https://demo.arkindex.org/.
arkindex-api-token: Your API token. If you do not know your API token, refer to this page.
--corpus-id: The UUID of the Arkindex project created in the previous step. This value can be copied from your Arkindex project details page, just below its name.

--worker-run-id: The UUID of the user worker run you created.
--folder-type: The type of the top level element ("folder")
--page-type: The type of the child level elements ("page")
--report: The path to the JSON report file ("report.json")
folder: The path to the local directory containing the 121795.json JSON file, generated using the previous command ("pellet_corpus/")

scrapers publish --folder-type folder \
                 --page-type page \
                 --report report.json \
                 --corpus-id aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee \
                 --arkindex-api-url https://demo.arkindex.org/ \
                 --arkindex-api-token aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee \
                 --worker-run-id aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
                 pellet_corpus/

Once the import is finished, you should be able to navigate through the folder named PELLET casimir marius in Arkindex:

An image from the Pellet corpus in Arkindex

Then, you can move back up on this page to follow this section where you will learn how to partition your data to create a dataset made up of three sets.

Optional section - Extended import capabilities¶

Transkribus collections¶

The procedure to import Transkribus collections, containing images and annotations, to Arkindex is documented here.

PAGE XML files¶

Warning

This section is intended for advanced users who wish to import their own data to Arkindex.

The following instructions are NOT needed to proceed with the rest of this tutorial.

If you want to import PAGE XML files to Arkindex, you can follow this documentation.