In the context of Machine Learning, a Dataset
is a collection of data. Datasets are organized in sets, which may or may not overlap. In Arkindex, datasets
are collections of elements.
Datasets
go through multiple states during their life cycle.
Open
: when a dataset
is created, it is in the Open
state. You can edit its details, the name of its sets and manage the elements included.Building
: when a worker tries to generate an archive of a dataset, it goes into the Building
state.Error
: when the worker failed to generate the dataset archive, the dataset goes into the Error
state.Complete
: when the worker succeeded in generating the archive, the dataset goes into the Complete
state. The dataset is now immutable and no element can be added or removed.If you want to use the dataset outside of Arkindex, using the API Client or the SQLite export, you do not need to change its state.
Datasets can be managed from the Datasets
tab in a project's details page.
From this interface, if you have contributor access to the project, you are able to:
Complete
state,To create a new dataset, click on the + button, on the bottom right of the dataset list. This opens a dataset creation modal.
Fill a name and a description for your dataset.
Names of datasets are unique in a corpus. This means that you cannot select a name already taken by another dataset.
Select the name of the sets of your dataset. They should match the name supported by the ML technology you plan to use later. You can always rename them later if there is a mismatch.
If you wish to avoid data-leakage, which is having elements in more than one set of your dataset, you can check the Require unique elements among sets checkbox.
The dataset's state is Open at first. In this state, you can add elements to your set and edit any of its attribute (name, description, set names, ...).
To edit an existing dataset, click on the pencil-shaped icon on the far right of a dataset's row, in the Actions column. This opens the same modal as the one described for dataset creation.
Edition is not available for Complete
datasets.
Once you have your dataset, you can add elements to each set.
If the elements of each set are already split in separate folders, the procedure to add them to your dataset is easier.
The flow is the same for each set, but you have to do it separately.
Repeat this operation for every folder selected for this set.
When all elements have been selected, browse to the selection, using the icon next to your email address in the navigation bar. The last operations are detailed in a later section. At the end, don't forget to unselect all elements to avoid data-leakage.
You first need to decide on the number of elements and the ratios of each split. The number of elements depend on the machine learning technology you are using. Some require larger amounts than others.
To avoid data-leakage, create a folder in your corpus, named after your dataset. This folder will hold all the elements selected for a split.
Then, browse to the folder element which holds the elements you want to use. Add the relevant filters to display your elements. To select page
elements from anywhere below this folder, add:
recursive=Yes
type=page
.To select elements at random, set the order to Random
, instead of Position
. The switch is available on the right of the filter bar.
For easier browsing, you can also increase the pagination size. There are multiple sizes available, pick one that is either:
100
if you want to select 95
elements),100
to select 400
elements).Repeat the following procedure for each set in your dataset.
Add all elements to the right set of the dataset, using the Add to a dataset button from the Actions menu. This will open a modal to select the dataset and the set.
A green notification will be displayed when the operation is done. You can browse to the dataset's details page to make sure your elements have been added.
There is a command-line tool that creates a random dataset from all elements in a folder or a corpus. Its documentation is available here.
This tool also supports picking elements from the whole corpus.
We recommend Ubuntu
or Mac OS X
to use this tool.
To view a dataset's details and its elements, click on the name of a dataset in the list. Circle through the tabs to see the elements in each set.
To remove an element from a set, use the — button in the bottom-right corner of its thumbnail.
To create a new dataset with the same collection of elements as another, you can use the Clone button, in the top-right corner of a dataset's details page. This will create a new dataset with the same elements and sets, in the Open
state. This is helpful when you need to build the v2 of a dataset from a v1 that is in the Complete
(thus immutable) state.
If you are an administrator on a project, you can delete an existing dataset from the datasets list page.
On an element's details page, there is a Datasets section listing all the datasets and sets that include the element. From this list you can remove the element from a dataset's set, if you have contributor access to the project.
These endpoints are the most useful to handle Datasets:
Once your dataset is ready, you can start training in Arkindex. Learn more about: