Datasets

    In the context of Machine Learning, a Dataset is a collection of data. Datasets are organized in sets, which may or may not overlap. In Arkindex, datasets are collections of elements.

    Rules🔗

    • A dataset must contain at least 1 set.
    • An element can be included in multiple sets within the same dataset.
    • A dataset is tied to a project and can only include elements from that project.

    Dataset states🔗

    Datasets go through multiple states during their life cycle.

    • Open: when a dataset is created, it is in the Open state. You can edit its details, the name of its sets and manage the elements included.
    • Building: when a worker tries to generate an archive of a dataset, it goes into the Building state.
    • Error: when the worker failed to generate the dataset archive, the dataset goes into the Error state.
    • Complete: when the worker succeeded in generating the archive, the dataset goes into the Complete state. The dataset is now immutable and no element can be added or removed.

    If you want to use the dataset outside of Arkindex, using the API Client or the SQLite export, you do not need to change its state.

    Different states of a Dataset
    Different states of a Dataset

    Manage your datasets in Arkindex🔗

    Datasets can be managed from the Datasets tab in a project's details page.

    Manage datasets in a project
    Manage datasets in a project

    From this interface, if you have contributor access to the project, you are able to:

    • create a new dataset,
    • edit an existing dataset, unless it is in the Complete state,
    • view the elements in each set of the datasets,
    • clone an existing dataset.

    Create a new dataset🔗

    To create a new dataset, click on the + button, on the bottom right of the dataset list. This opens a dataset creation modal.

    Create a new dataset
    Create a new dataset

    Fill a name and a description for your dataset.

    Warning

    Names of datasets are unique in a corpus. This means that you cannot select a name already taken by another dataset.

    Select the name of the sets of your dataset. They should match the name supported by the ML technology you plan to use later. You can always rename them later if there is a mismatch.

    If you wish to avoid data-leakage, which is having elements in more than one set of your dataset, you can check the Require unique elements among sets checkbox.

    Created dataset
    Created dataset

    The dataset's state is Open at first. In this state, you can add elements to your set and edit any of its attribute (name, description, set names, ...).

    Edit an existing dataset🔗

    To edit an existing dataset, click on the pencil-shaped icon on the far right of a dataset's row, in the Actions column. This opens the same modal as the one described for dataset creation.

    Edition is not available for Complete datasets.

    Adding elements to a dataset🔗

    Using the web interface🔗

    Once you have your dataset, you can add elements to each set.

    Using existing splits🔗

    If the elements of each set are already split in separate folders, the procedure to add them to your dataset is easier.

    The flow is the same for each set, but you have to do it separately.

    1. Browse to a folder selected for the set
    2. List all elements that should be added, recursively if there are subfolders.
    3. Add all these elements to the selection. To do that faster,
      1. Increase the pagination size to the maximum (Display -> Pagination size),
      2. Use the Select all displayed elements button from the Actions menu on the right.

    Repeat this operation for every folder selected for this set.

    When all elements have been selected, browse to the selection, using the icon next to your email address in the navigation bar. The last operations are detailed in a later section. At the end, don't forget to unselect all elements to avoid data-leakage.

    Create new splits🔗

    You first need to decide on the number of elements and the ratios of each split. The number of elements depend on the machine learning technology you are using. Some require larger amounts than others.

    To avoid data-leakage, create a folder in your corpus, named after your dataset. This folder will hold all the elements selected for a split.

    Then, browse to the folder element which holds the elements you want to use. Add the relevant filters to display your elements. To select page elements from anywhere below this folder, add:

    • recursive=Yes
    • type=page.
    List page elements under a folder
    List page elements under a folder

    To select elements at random, set the order to Random, instead of Position. The switch is available on the right of the filter bar.

    For easier browsing, you can also increase the pagination size. There are multiple sizes available, pick one that is either:

    • close to the number of elements you wish to select (e.g. 100 if you want to select 95 elements),
    • a divisor of the number of elements (e.g. 100 to select 400 elements).
    Display 100 elements per page and order at random
    Display 100 elements per page and order at random

    Repeat the following procedure for each set in your dataset.

    1. Browse to the folder which holds the elements, add the filters and random ordering as before,
    2. Set the optimal pagination size depending on the number of elements to add,
    3. Use the Select all displayed elements button from the Actions menu on the right (you might have to browse multiple pages),
    4. Add selected elements to the dataset
    5. Move elements to the data-leakage folder.
      1. Use the Move elements button in the Actions menu,
      2. Select the folder created at the very beginning to avoid data leakage,
      3. Wait for the asynchronous task to end, it should take a few minutes at most.
    6. Unselect all elements, using the dedicated button on the selection page.
    Add elements to a dataset from selection🔗

    Add all elements to the right set of the dataset, using the Add to a dataset button from the Actions menu. This will open a modal to select the dataset and the set.

    Add selected elements to the 'train' set of the 'My Dataset' dataset
    Add selected elements to the 'train' set of the 'My Dataset' dataset

    A green notification will be displayed when the operation is done. You can browse to the dataset's details page to make sure your elements have been added.

    The dataset's 'train' set now has 100 elements
    The dataset's 'train' set now has 100 elements

    Command Line Interface🔗

    There is a command-line tool that creates a random dataset from all elements in a folder or a corpus. Its documentation is available here.

    This tool also supports picking elements from the whole corpus.

    We recommend Ubuntu or Mac OS X to use this tool.

    View the dataset's elements🔗

    To view a dataset's details and its elements, click on the name of a dataset in the list. Circle through the tabs to see the elements in each set.

    View the elements in each set
    View the elements in each set

    To remove an element from a set, use the button in the bottom-right corner of its thumbnail.

    Clone an existing dataset🔗

    To create a new dataset with the same collection of elements as another, you can use the Clone button, in the top-right corner of a dataset's details page. This will create a new dataset with the same elements and sets, in the Open state. This is helpful when you need to build the v2 of a dataset from a v1 that is in the Complete (thus immutable) state.

    Delete an existing dataset🔗

    If you are an administrator on a project, you can delete an existing dataset from the datasets list page.

    See the datasets an element is a part of🔗

    On an element's details page, there is a Datasets section listing all the datasets and sets that include the element. From this list you can remove the element from a dataset's set, if you have contributor access to the project.

    See which datasets an element is included in
    See which datasets an element is included in

    API Endpoints🔗

    These endpoints are the most useful to handle Datasets:

    Use datasets to train a model🔗

    Once your dataset is ready, you can start training in Arkindex. Learn more about: