Corpus import

    In this tutorial, you will learn how to import images in Arkindex.

    Corpus description🔗

    As an example, you will import the Pellet dataset from the Europeana 1914-1918 collection.

    The corpus contains 471 scanned pages related to Casimir Marius PELLET, a French soldier during World War I. Each document has been transcribed by volunteers and includes descriptive metadata.

    Information

    Annotations from Europeana are available at page level. However, most Machine Learning models require line level annotations. In this tutorial, we will show you how to create ground truth for text line segmentation and transcription, and how to train machine learning models from these annotations.

    The pages are written in French and include various content types, such as campaign diaries, photographs, and postcards. We have selected this corpus as it covers a large variety of documents, as illustrated bellow.

    Pages from the Pellet corpus
    Pages from the Pellet corpus

    Of course you may import your own data directly in Arkindex, using file uploads. Arkindex supports images, PDFs, METS, ALTO, ZIP archive compatible with Transkribus, etc.

    Create a project in Arkindex🔗

    Information

    This section expects you to have an Arkindex account. Learn how to register here.

    Log in to Arkindex by entering your email and password.

    On the front page, you will find an empty project entitled My Project. We will publish the data from Europeana in this project. Alternatively, you can create a new project by clicking on the New Project button at the top right of the page. Note that this project is personal and can only be accessed by you.

    To edit your project name and and description:

    • Click on My Project
    Select your project
    Select your project
    • Go on your project information page
    Go to your project information page
    Go to your project information page
    • Edit your project name and description and click on Update
      • Name: Europeana | Pellet
      • Description: Corpus from [Europeana](https://europeana.transcribathon.eu/documents/story/?story=121795)
    Edit your project name and description
    Edit your project name and description
    Information

    The project Description field supports Markdown input.

    Import data to Arkindex🔗

    For the purposes of this tutorial, we have prepared a ZIP archive, containing all the images from the Pellet corpus, which is freely available on our servers. You can download it directly from this link.

    Once you have downloaded the data, you can import it to Arkindex. To do so, go to your project, then click on Actions > Import files.

    Access the import page from your project
    Access the import page from your project

    You will be redirected to a new page from where you can import files to Arkindex. Click on the Select files... button located next to the From local files label, and browse your file system to find the ZIP archive you just downloaded.

    Add the downloaded archive to the files to import
    Add the downloaded archive to the files to import

    The archive upload to Arkindex will take from a few seconds to a couple of minutes.

    The archive is being uploaded
    The archive is being uploaded

    Once the archive is successfully uploaded to Arkindex, a green tick is displayed next to its name, in the list of Available files to import.

    The archive was successfully uploaded and can be imported
    The archive was successfully uploaded and can be imported

    It means that you can proceed to the next step and click the Import blue button available in the bottom-right corner of the current page.

    You will be redirected to the Process status page, wait a bit for it to start (i.e. for its status to go from Unscheduled to Running). This process will extract the ZIP archive and upload every image it contains to Arkindex in a few moments.

    The import process is running and extracting the archive
    The import process is running and extracting the archive

    Once your process has ended (i.e. its status has changed to Completed), you can navigate back to your project to view the 471 imported images by clicking your project's name under the Project label.

    Navigate back to your project from the process page
    Navigate back to your project from the process page

    From there, you should be able to browse through the newly created folder named europeana_pellet_images.zip:

    Browse the images from the imported folder
    Browse the images from the imported folder

    You can also rename this folder to PELLET casimir marius (which is much nicer) by clicking the small pencil icon, next to its name, at the top-right corner of the page. Do not forget to validate your input by clicking the pencil blue button once you are done.

    Rename the imported folder
    Rename the imported folder
    Information

    This import procedure is simplified and only allows you to import partial data from the Pellet corpus. This is sufficient for this tutorial, since we will only be using images.

    However, the Pellet corpus is much more substantial, as it also contains a large amount of metadata and page level transcriptions from Europeana. If you wish to import this additional data, you can follow the advanced import tutorial at the bottom of this page.

    Data partitioning🔗

    To train Machine Learning models, you first need to select a random sample of the corpus. In this tutorial, we will limit the sample to 100 documents to reduce the annotation effort. From this sample, you will create three sets for training, validation and evaluation.

    • 80 page elements (80% of the sample) in the train set (used for model training)
    • 10 page elements (10% of the sample) in the val set (used for model validation)
    • 10 page elements (10% of the sample) in the test set (used for model evaluation)
    Information

    In a real HTR project, you would typically select a larger subset of the corpus, using the same partitioning strategy. For example you could sample 500 documents: 400 for training, 50 for validation and 50 for testing.

    Dataset creation🔗

    First, you need to create an Arkindex dataset. To do that, go to your project, then click on Actions > Project information > Datasets > +. Enter a description of this dataset, and set the following set names:

    • train,
    • val,
    • test.
    Create the dataset
    Create the dataset

    We also recommend that you create three folders inside the PELLET casimir marius folder, named train, val, and test. To create a folder, click on Actions, then Add folder and enter the name of your folder.

    Create three folders
    Create three folders

    Add pages to a dataset set🔗

    Train set🔗

    To select 80 random page elements for the train set of your dataset, follow these steps:

    1. Go to the folder named PELLET casimir marius
    2. Select 80 random pages
    • Filter elements by type Page and sort them by Random order
    Filter pages and sort randomly
    Filter pages and sort randomly
    • Select 20 elements by clicking on Actions > Select all displayed elements
    Select all displayed elements
    Select all displayed elements
    • Go to the next page
    Go to the next page
    Go to the next page
    • Repeat these steps until you have 80 elements in your selection
    View how many elements are currently selected
    View how many elements are currently selected
    • If needed, you can select or unselect elements individually by clicking on the checkbox at the bottom right of each element
    Select pages one by one
    Select pages one by one
    1. Add these pages to your dataset
    • Go to the Selection page by clicking on the document icon
    Go to your selection
    Go to your selection
    • From this page, click on Actions > Add to a dataset, then select the train set of your dataset
    Add your selection to the train set of the Pellet dataset
    Add your selection to the train set of the Pellet dataset
    1. Move the selected elements to the train folder
    Move elements to the train folder
    Move elements to the train folder
    1. Clear your selection by clicking on Unselect all
    Clear your selection
    Clear your selection

    Validation and test sets🔗

    To add 10 random page elements to the val set of your dataset, you can simply select them by clicking on the checkbox at the bottom right of each image.

    Select pages one by one
    Select pages one by one

    You can then add these pages to your dataset and move them to the val folder by following the steps 3. to 5. from the section above.

    Finally, repeat these steps for the test set.

    You now have a comprehensive dataset. You can go back to your project by clicking on its name.

    Go back to your project from the selection
    Go back to your project from the selection

    Visualize your dataset🔗

    Click on Actions > Project information > Datasets, then select your dataset to visualize its content:

    The Pellet dataset imported in Arkindex
    The Pellet dataset imported in Arkindex

    Next steps🔗

    You can now annotate text lines and illustrations. This will provide you ground truth data to train a segmentation model on Arkindex.


    Optional section - Full data import to Arkindex🔗

    Warning

    This section is intended for advanced users who wish to import the entire Pellet corpus into Arkindex (images, transcriptions and metadata).

    The following instructions are NOT needed to proceed with the rest of this tutorial.

    Moreover, importing page level transcriptions from Europeana will not reduce the workload of this tutorial. These are not enough to train a Machine Learning model to transcribe the text as they usually work at line level so we need both the localization of the line (segmentation) and its transcription.

    Two steps are required to import the Pellet corpus, in its entirety, to Arkindex:

    1. Extract the data from Europeana (images, transcriptions and metadata)
    2. Publish it to your Arkindex project
    Information

    You will need Python 3.10 and a shell environment (we recommend Ubuntu or Mac OS X)

    We have released a Python package named arkindex-scrapers to help you achieve these steps. To install it to your environment, run:

    pip install teklia-scrapers
    

    Data extraction🔗

    To extract data from the Europeana website, you need to specify two arguments:

    • --story_id: the Europeana identifier for the Pellet corpus ("121795")
    • --output_dir: the directory in which the corpus will be extracted ("pellet_corpus")

    Running the following command will start the import:

    scrapers eu-trans --story_id 121795 --output_dir pellet_corpus/
    
    Warning

    The command should take about 2 hours to complete, depending on your network connection and the current availability of Europeana. If you do not have that much time, you can download the data directly from this link.

    Once the extraction is done, you will find a JSON file named 121795.json in the directory named pellet_corpus/.

    Publication to Arkindex🔗

    Then, you can use the scrapers publish command to publish the data to Arkindex.

    You will need to provide the following arguments:

    • arkindex-api-url: The Arkindex instance in which you wish to import the corpus. By default, you should use https://demo.arkindex.org/.
    • arkindex-api-token: Your API token. If you do not know your API token, refer to this page.
    • --corpus-id: The UUID of the Arkindex project created in the previous step. This value can be copied from your Arkindex project details page, just below its name.
    Find your project's UUID on Arkindex
    Find your project's UUID on Arkindex
    • --worker-run-id: The worker run UUID that will be used to import the data. Refer to this page to create your own worker run.
    • --folder-type: The type of the top level element ("folder")
    • --page-type: The type of the child level elements ("page")
    • --report: The path to the JSON report file ("report.json")
    • folder: The path to the local directory containing the 121795.json JSON file, generated using the previous command ("pellet_corpus/")
    scrapers publish --folder-type folder \
                     --page-type page \
                     --report report.json \
                     --corpus-id aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee \
                     --arkindex-api-url https://demo.arkindex.org/ \
                     --arkindex-api-token aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee \
                     --worker-run-id aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
                     pellet_corpus/ 
    

    Once the import is finished, you should be able to navigate through the folder named PELLET casimir marius in Arkindex:

    The Pellet corpus in Arkindex
    The Pellet corpus in Arkindex

    Then, you can move back up on this page to follow this section where you will learn how to partition your data to create a dataset made up of three sets.