Datasets
In the context of Machine Learning, a Dataset
is a collection of data. Datasets are organized in sets, which may or may not overlap. In Arkindex, datasets
are collections of elements.
Rules¶
- A dataset must contain at least 1 set.
- An element can be included in multiple sets within the same dataset.
- A dataset is tied to a project and can only include elements from that project.
Dataset states¶
Datasets
go through multiple states during their life cycle.
Open
: when adataset
is created, it is in theOpen
state. You can edit its details, the name of its sets and manage the elements included.Building
: when a worker tries to generate an archive of a dataset, it goes into theBuilding
state.Error
: when the worker failed to generate the dataset archive, the dataset goes into theError
state.Complete
: when the worker succeeded in generating the archive, the dataset goes into theComplete
state. The dataset is now immutable and no element can be added or removed.
If you want to use the dataset outside of Arkindex, using the API Client or the SQLite export, you do not need to change its state.
Manage your datasets in Arkindex¶
Datasets can be managed from the Datasets
tab in a project’s details page.
From this interface, if you have contributor access to the project, you are able to:
- create a new dataset,
- edit an existing dataset, unless it is in the
Complete
state, - view the elements in each set of the datasets,
- clone an existing dataset.
Create a new dataset¶
To create a new dataset, click on the + button, on the bottom right of the dataset list. This opens a dataset creation modal.
Fill a name and a description for your dataset.
Warning
Names of datasets are unique in a corpus. This means that you cannot select a name already taken by another dataset.
Select the name of the sets of your dataset. They should match the name supported by the ML technology you plan to use later. You can always rename them later if there is a mismatch.
If you wish to avoid data-leakage, which is having elements in more than one set of your dataset, you can check the Require unique elements among sets checkbox.
The dataset’s state is Open at first. In this state, you can add elements to your set and edit any of its attribute (name, description, set names, …).
Edit an existing dataset¶
To edit an existing dataset, click on the pencil-shaped icon on the far right of a dataset’s row, in the Actions column. This opens the same modal as the one described for dataset creation.
Edition is not available for Complete
datasets.
Adding elements to a dataset¶
Using the web interface¶
Once you have your dataset, you can add elements to each set.
Using existing splits¶
If the elements of each set are already split in separate folders, the procedure to add them to your dataset is easier.
The flow is the same for each set, but you have to do it separately.
- Browse to a folder selected for the set
- List all elements that should be added, recursively if there are subfolders.
- Add all these elements to the selection. To do that faster,
- Increase the pagination size to the maximum (Display -> Pagination size),
- Use the Select all displayed elements button from the Elements menu on the right.
Repeat this operation for every folder selected for this set.
When all elements have been selected, browse to the selection, using the icon next to your email address in the navigation bar. The last operations are detailed in a later section. At the end, don’t forget to unselect all elements to avoid data-leakage.
Create new splits¶
You first need to decide on the number of elements and the ratios of each split. The number of elements depend on the machine learning technology you are using. Some require larger amounts than others.
To avoid data-leakage, create a folder in your corpus, named after your dataset. This folder will hold all the elements selected for a split.
Then, browse to the folder element which holds the elements you want to use. Add the relevant filters to display your elements. To select page
elements from anywhere below this folder, add:
recursive=Yes
type=page
.
To select elements at random, set the order to Random
, instead of Position
. The switch is available on the right of the filter bar.
For easier browsing, you can also increase the pagination size. There are multiple sizes available, pick one that is either:
- close to the number of elements you wish to select (e.g.
100
if you want to select95
elements), - a divisor of the number of elements (e.g.
100
to select400
elements).
Repeat the following procedure for each set in your dataset.
- Browse to the folder which holds the elements, add the filters and random ordering as before,
- Set the optimal pagination size depending on the number of elements to add,
- Use the Select all displayed elements button from the Elements menu on the right (you might have to browse multiple pages),
- Add selected elements to the dataset
- Move elements to the data-leakage folder.
- Use the Move elements button in the Actions menu,
- Select the folder created at the very beginning to avoid data leakage,
- Wait for the asynchronous task to end, it should take a few minutes at most.
- Unselect all elements, using the dedicated button on the selection page.
Add elements to a dataset from selection¶
Add all elements to the right set of the dataset, using the Add to a dataset button from the Actions menu. This will open a modal to select the dataset and the set.
A green notification will be displayed when the operation is done. You can browse to the dataset’s details page to make sure your elements have been added.
Command Line Interface¶
There is a command-line tool that creates a random dataset from all elements in a folder or a corpus. Its documentation is available here.
This tool also supports picking elements from the whole corpus.
We recommend Ubuntu
or Mac OS X
to use this tool.
View the dataset’s elements¶
To view a dataset’s details and its elements, click on the name of a dataset in the list. Circle through the tabs to see the elements in each set.
To remove an element from a set, use the — button in the bottom-right corner of its thumbnail.
Clone an existing dataset¶
To create a new dataset with the same collection of elements as another, you can use the Clone button, in the top-right corner of a dataset’s details page. This will create a new dataset with the same elements and sets, in the Open
state. This is helpful when you need to build the v2 of a dataset from a v1 that is in the Complete
(thus immutable) state.
Delete an existing dataset¶
If you are an administrator on a project, you can delete an existing dataset from the datasets list page.
See the datasets an element is a part of¶
On an element’s details page, there is a Datasets section listing all the datasets and sets that include the element. From this list you can remove the element from a dataset’s set, if you have contributor access to the project.
API Endpoints¶
These endpoints are the most useful to handle Datasets:
Use datasets to train a model¶
Once your dataset is ready, you can start training in Arkindex. Learn more about:
- creating dataset processes,
- training a model, using said processes.