Corpus import

    In this tutorial, you will learn how to import images and metadata in Arkindex.

    Corpus description🔗

    As an example, you will import the Pellet dataset from the Europeana 1914-1918 collection.

    The corpus contains 471 scanned documents related to Casimir Marius PELLET, a French soldier during World War I.

    The documents are written in French and include various content types, such as campaign diaries, photographs, and postcards. Each document has been transcribed by volunteers and includes descriptive metadata. We have selected this corpus as it covers a large variety of document, while still being relatively small to avoid complexity due to large annotation and training needs.

    Documents from the Pellet corpus
    Documents from the Pellet corpus

    Of course you may import your own data directly in Arkindex, using file uploads. Arkindex supports images, PDFs, METS, ALTO, ZIP archive compatible with Transkribus, ...

    Create a project in Arkindex🔗

    Information

    This section expects you to have an Arkindex account. Learn how to register here.

    Log in to Arkindex by entering your email and password.

    On the front page, you will find an empty corpus entitled My Project. We will publish the data from Europeana in this corpus. Alternatively, you can create a new project by clicking on the New Project button at the top right of the page. Note that this corpus is personal and can only be accessed by you.

    To edit your project name and and description:

    • Click on My Project
    Select your project
    Select your project
    • Go on your project information page
    Go to your project information page
    Go to your project information page
    • Edit your project's name and description and click on Update
      • Name: Europeana | Pellet
      • Description: Corpus from [Europeana](https://europeana.transcribathon.eu/documents/story/?story=121795)
    Edit your project's name and description
    Edit your project's name and description

    Import data to Arkindex🔗

    Two steps are required to import the corpus in Arkindex:

    1. Extract the data from Europeana (images, transcriptions and metadata)
    2. Publish it to your Arkindex project
    Information

    You will need Python 3.10 and a shell environment (we recommend Ubuntu or Mac OS X)

    We have released a Python package named arkindex-scrapers to help you achieve these steps. To install it to your environment, run:

    pip install teklia-scrapers
    

    Data extraction🔗

    To extract data from the Europeana website, you need to specify two arguments:

    • --story_id: the Europeana identifier for the Pellet corpus ("121795")
    • --output_dir: the directory in which the corpus will be extracted ("pellet_corpus")

    Running the following command will start the import:

    scrapers eu-trans --story_id 121795 --output_dir pellet_corpus/
    
    Warning

    The command should take about 2 hours to complete, depending on your network connection and the current availability of Europeana. If you do not have that much time, you can download the data directly from this link.

    Once the extraction is done, you will find a JSON file named 121795.json in the directory named pellet_corpus/.

    Publication to Arkindex🔗

    Then, you can use the scrapers publish command to publish the data to Arkindex.

    You will need to provide the following arguments:

    • arkindex-api-url: The Arkindex instance in which you wish to import the corpus. By default, you should use https://demo.arkindex.org/.
    • arkindex-api-token: Your API token. If you do not know your API token, refer to this page.
    • --corpus-id: The UUID of the Arkindex project created in the previous step. This value can be copied from your Arkindex project details page, just below its name.
    Find your project's UUID on Arkindex
    Find your project's UUID on Arkindex
    • --worker-run-id: The worker run UUID that will be used to import the data. Refer to this page to create your own worker run.
    • --folder-type: The type of the top level element ("folder")
    • --page-type: The type of the child level elements ("page")
    • --report: The path to the JSON report file ("report.json")
    • folder: The path to the local directory containing the 121795.json JSON file, generated using the previous command ("pellet_corpus/")
    scrapers publish --folder-type folder \
                     --page-type page \
                     --report report.json \
                     --corpus-id aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee \
                     --arkindex-api-url https://demo.arkindex.org/ \
                     --arkindex-api-token aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee \
                     --worker-run-id aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
                     pellet_corpus/ 
    

    Once the import is finished, you should be able to navigate through the folder named PELLET casimir marius in Arkindex:

    The Pellet corpus in Arkindex
    The Pellet corpus in Arkindex

    Data partitioning🔗

    To train Machine Learning models on this corpus, you need to split the corpus in three sets for training, validation and evaluation:

    • 321 page elements (around 75% of the corpus) in the train set (used for model training)
    • 50 page elements (around 12.5% of the corpus) in the val set (used for model validation)
    • 50 page elements (around 12.5% of the corpus) in the test set (used for model evaluation)

    Dataset creation🔗

    First, you need to create an Arkindex Dataset. To do that, go to your corpus, then click on Actions > Project information > Datasets > +. Enter a description of this dataset, and set the following set names:

    • train,
    • val,
    • test.
    Create the dataset
    Create the dataset

    We also recommend that you create three folders inside the PELLET casimir marius folder, named train, val, and test. To create a folder, click on Actions, then Add folder and enter the name of your folder.

    Create three folders
    Create three folders

    Add pages to a dataset set🔗

    To add 50 random pages to a the test set of your Dataset, follow these steps:

    1. Go to the folder named PELLET casimir marius
    2. Select 50 random pages
    • Filter elements by type page and sort them by Random order
    Filter pages and sort randomly
    Filter pages and sort randomly
    • Click on Display > Pagination size and set it to 50
    Update pagination size
    Update pagination size
    • Select 50 pages by clicking on Actions > Select all displayed elements
    Select all displayed elements
    Select all displayed elements
    1. Add these pages to your Dataset
    • Go to the Selection page
    Selection
    Selection
    • Click on Actions > Add to a dataset, then select the test set of your Dataset
    The Pellet dataset imported in Arkindex
    The Pellet dataset imported in Arkindex
    1. Move these elements to the test folder
    Move elements to the test folder
    Move elements to the test folder

    Repeat these steps for the val set.

    Finally, select all the remaining page elements and add them to the train set and folder.

    Visualize your dataset🔗

    Click on the Dataset name to visualize its content:

    The Pellet dataset imported in Arkindex
    The Pellet dataset imported in Arkindex

    Next steps🔗

    As you can see, transcriptions on this corpus are available at page-level.

    You can now annotate text lines and illustrations. This will provide you ground truth data to train a segmentation model on Arkindex.