In this tutorial, you will learn how to import images in Arkindex.
As an example, you will import the Pellet dataset from the Europeana 1914-1918 collection.
The corpus contains 471 scanned pages related to Casimir Marius PELLET, a French soldier during World War I. Each document has been transcribed by volunteers and includes descriptive metadata.
Annotations from Europeana are available at page level. However, most Machine Learning models require line level annotations. In this tutorial, we will show you how to create ground truth for text line segmentation and transcription, and how to train machine learning models from these annotations.
The pages are written in French and include various content types, such as campaign diaries, photographs, and postcards. We have selected this corpus as it covers a large variety of documents, as illustrated bellow.
Of course you may import your own data directly in Arkindex, using file uploads. Arkindex supports images, PDFs, METS, ALTO, ZIP archive compatible with Transkribus, etc.
This section expects you to have an Arkindex account. Learn how to register here.
Log in to Arkindex by entering your email and password.
On the front page, you will find an empty project entitled My Project
. We will publish the data from Europeana in this project. Alternatively, you can create a new project by clicking on the New Project
button at the top right of the page. Note that this project is personal and can only be accessed by you.
To edit your project name and and description:
My Project
Update
Europeana | Pellet
Corpus from [Europeana](https://europeana.transcribathon.eu/documents/story/?story=121795)
The project Description field supports Markdown input.
For the purposes of this tutorial, we have prepared a ZIP archive, containing all the images from the Pellet corpus, which is freely available on our servers. You can download it directly from this link.
Once you have downloaded the data, you can import it to Arkindex. To do so, go to your project, then click on Actions
> Import files
.
You will be redirected to a new page from where you can import files to Arkindex. Click on the Select files...
button located next to the From local files label, and browse your file system to find the ZIP archive you just downloaded.
The archive upload to Arkindex will take from a few seconds to a couple of minutes.
Once the archive is successfully uploaded to Arkindex, a green tick is displayed next to its name, in the list of Available files to import.
It means that you can proceed to the next step and click the Import
blue button available in the bottom-right corner of the current page.
You will be redirected to the Process status page, wait a bit for it to start (i.e. for its status to go from Unscheduled
to Running
). This process will extract the ZIP archive and upload every image it contains to Arkindex in a few moments.
Once your process has ended (i.e. its status has changed to Completed
), you can navigate back to your project to view the 471 imported images by clicking your project's name under the Project label.
From there, you should be able to browse through the newly created folder named europeana_pellet_images.zip
:
You can also rename this folder to PELLET casimir marius
(which is much nicer) by clicking the small pencil icon, next to its name, at the top-right corner of the page. Do not forget to validate your input by clicking the pencil blue button once you are done.
This import procedure is simplified and only allows you to import partial data from the Pellet corpus. This is sufficient for this tutorial, since we will only be using images.
However, the Pellet corpus is much more substantial, as it also contains a large amount of metadata and page level transcriptions from Europeana. If you wish to import this additional data, you can follow the advanced import tutorial at the bottom of this page.
To train Machine Learning models, you first need to select a random sample of the corpus. In this tutorial, we will limit the sample to 100 documents to reduce the annotation effort. From this sample, you will create three sets for training, validation and evaluation.
page
elements (80% of the sample) in the train
set (used for model training)page
elements (10% of the sample) in the val
set (used for model validation)page
elements (10% of the sample) in the test
set (used for model evaluation)In a real HTR project, you would typically select a larger subset of the corpus, using the same partitioning strategy. For example you could sample 500 documents: 400 for training, 50 for validation and 50 for testing.
First, you need to create an Arkindex dataset. To do that, go to your project, then click on Actions
> Project information
> Datasets
> +
. Enter a description of this dataset, and set the following set names:
train
, val
,test
.We also recommend that you create three folders
inside the PELLET casimir marius
folder, named train
, val
, and test
. To create a folder, click on Actions
, then Add folder
and enter the name of your folder.
To select 80 random page
elements for the train
set of your dataset, follow these steps:
PELLET casimir marius
Page
and sort them by Random
order Actions
> Select all displayed elements
Selection
page by clicking on the document icon Actions
> Add to a dataset
, then select the train
set of your datasettrain
folder Unselect all
To add 10 random page
elements to the val
set of your dataset, you can simply select them by clicking on the checkbox at the bottom right of each image.
You can then add these pages to your dataset and move them to the val
folder by following the steps 3. to 5. from the section above.
Finally, repeat these steps for the test
set.
You now have a comprehensive dataset. You can go back to your project by clicking on its name.
Click on Actions
> Project information
> Datasets
, then select your dataset to visualize its content:
You can now annotate text lines and illustrations. This will provide you ground truth data to train a segmentation model on Arkindex.
This section is intended for advanced users who wish to import the entire Pellet corpus into Arkindex (images, transcriptions and metadata).
The following instructions are NOT needed to proceed with the rest of this tutorial.
Moreover, importing page level transcriptions from Europeana will not reduce the workload of this tutorial. These are not enough to train a Machine Learning model to transcribe the text as they usually work at line level so we need both the localization of the line (segmentation) and its transcription.
Two steps are required to import the Pellet corpus, in its entirety, to Arkindex:
You will need Python 3.10 and a shell environment (we recommend Ubuntu or Mac OS X)
We have released a Python package named arkindex-scrapers
to help you achieve these steps. To install it to your environment, run:
pip install teklia-scrapers
To extract data from the Europeana website, you need to specify two arguments:
--story_id
: the Europeana identifier for the Pellet corpus ("121795"
)--output_dir
: the directory in which the corpus will be extracted ("pellet_corpus"
)Running the following command will start the import:
scrapers eu-trans --story_id 121795 --output_dir pellet_corpus/
The command should take about 2 hours to complete, depending on your network connection and the current availability of Europeana. If you do not have that much time, you can download the data directly from this link.
Once the extraction is done, you will find a JSON file named 121795.json
in the directory named pellet_corpus/
.
Then, you can use the scrapers publish
command to publish the data to Arkindex.
You will need to provide the following arguments:
arkindex-api-url
: The Arkindex instance in which you wish to import the corpus. By default, you should use https://demo.arkindex.org/.arkindex-api-token
: Your API token. If you do not know your API token, refer to this page.--corpus-id
: The UUID of the Arkindex project created in the previous step. This value can be copied from your Arkindex project details page, just below its name. --worker-run-id
: The worker run UUID that will be used to import the data. Refer to this page to create your own worker run.--folder-type
: The type of the top level element ("folder"
)--page-type
: The type of the child level elements ("page"
)--report
: The path to the JSON report file ("report.json"
)folder
: The path to the local directory containing the 121795.json
JSON file, generated using the previous command ("pellet_corpus/"
)scrapers publish --folder-type folder \
--page-type page \
--report report.json \
--corpus-id aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee \
--arkindex-api-url https://demo.arkindex.org/ \
--arkindex-api-token aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee \
--worker-run-id aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
pellet_corpus/
Once the import is finished, you should be able to navigate through the folder named PELLET casimir marius
in Arkindex:
Then, you can move back up on this page to follow this section where you will learn how to partition your data to create a dataset made up of three sets.