Datasets

    In the context of Machine Learning, a Dataset is a collection of data. Datasets are organized in sets, which may or may not overlap. In Arkindex, datasets are collections of elements.

    Rules🔗

    • A dataset must contain at least 1 set.
    • An element can be included in multiple sets within the same dataset.
    • A dataset is tied to a project and can only include elements from that project.

    Dataset states🔗

    Datasets go through multiple states during their life cycle.

    • Open: when a dataset is created, it is in the Open state. You can edit its details, the name of its sets and manage the elements included.
    • Building: when a worker tries to generate an archive of a dataset, it goes into the Building state.
    • Error: when the worker failed to generate the dataset archive, the dataset goes into the Error state.
    • Complete: when the worker succeeded in generating the archive, the dataset goes into the Complete state. The dataset is now immutable and no element can be added or removed.

    If you want to use the dataset outside of Arkindex, using the API Client or the SQLite export, you do not need to change its state.

    Different states of a Dataset
    Different states of a Dataset

    Manage your datasets in Arkindex🔗

    Datasets can be managed from the Datasets tab in a project's details page.

    Manage datasets in a project
    Manage datasets in a project

    From this interface, if you have contributor access to the project, you are able to:

    • create a new dataset,
    • edit an existing dataset, unless it is in the Complete state,
    • view the elements in each set of the datasets,
    • clone an existing dataset.

    Create a new dataset🔗

    To create a new dataset, click on the + button, on the bottom right of the dataset list. This opens a dataset creation modal.

    Create a new dataset
    Create a new dataset

    To create a new dataset, the following fields are mandatory:

    • the name of the dataset,
    • the dataset's description.

    The sets field is optional; if you leave it empty, then your dataset will be created with the following default sets:

    • training,
    • validation,
    • test.

    Edit an existing dataset🔗

    To edit an existing dataset, click on the pencil-shaped icon on the far right of a dataset's row, in the Actions column. This opens the same modal as the one described for dataset creation.

    Edition is not available for Complete datasets.

    View the dataset's elements🔗

    To view a dataset's details and its elements, click on the name of a dataset in the list. Circle through the tabs to see the elements in each set.

    View the elements in each set
    View the elements in each set

    To remove an element from a set, use the — button in the bottom-right corner of its thumbnail.

    Clone an existing dataset🔗

    To create a new dataset with the same collection of elements as another, you can use the Clone button, in the top-right corner of a dataset's details page. This will create a new dataset with the same elements and sets, in the Open state. This is helpful when you need to build the v2 of a dataset from a v1 that is in the Complete (thus immutable) state.

    Delete an existing dataset🔗

    If you are an administrator on a project, you can delete an existing dataset from the datasets list page.

    See the datasets an element is a part of🔗

    On an element's details page, there is a Datasets section listing all the datasets and sets that include the element. From this list you can remove the element from a dataset's set, if you have contributor access to the project.

    See which datasets an element is included in
    See which datasets an element is included in

    API Endpoints🔗

    These endpoints are the most useful to handle Datasets:

    Use datasets to train a model🔗

    Once your dataset is ready, you can start training in Arkindex. Learn more about: