Creating ground truth for transcription
In this tutorial, you will learn how to create ground truth data to train an HTR model, using the Callico collaborative annotation platform.
This section should be followed and performed after you have completed the steps for generating ground truth data to train an image segmentation model, available on this page.
Import data to transcribe in Callico¶
Since you learned how to create image segmentation data in the previous sections, you should have:
- Accounts on both the Arkindex and the Callico instances,
- An Arkindex project along with a dataset containing data from the Pellet corpus,
Text line
elements, annotated on Callico and exported back to Arkindex, available in your dataset,- A project containing one completed segmentation campaign on Callico.
The first step of this tutorial will be to import data to transcribe to Callico. You can log in on Callico’s demonstration instance and access the details page of your project from the homepage by clicking on it.
Import an Arkindex dataset¶
Let’s start by importing the data to be transcribed. You can click on the Import from Arkindex action in the Elements section of the menu on the left side of the project details page:
Then, fill in the import form as presented below, to import all of the Text line
elements along with their Page
parent from your dataset containing data from Europeana:
- Process name - Name of your process to import elements from Arkindex.
Element- Not relevant when importing an Arkindex dataset.- Dataset - UUID of your Arkindex dataset, you have to replace the
aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
value by yours which can be copied from your Arkindex dataset details page, just below its name.
Filter sets to import- To filter the sets from your Arkindex dataset that should be imported to Callico, we do not use it in this tutorial.- Filter types to import - To filter the Arkindex elements to import by their type, here we want to import
Page
andText line
elements from your dataset. Hold theCTRL
key from your keyboard while clicking to select multiple types. - All subsequent fields from this form are to be ignored in this tutorial, you can learn more about them in a dedicated page from Callico’s documentation if you wish to.
Track your import’s progress¶
Once you have started the data import, you will be redirected to a new page where you can track its progress. Note that this page is not dynamically refreshed. You will need to reload it manually to see updated status and logs. When the import is complete, its status will be updated to Completed
.
Create transcription tasks to annotate¶
While Arkindex elements are being imported into your Callico project, you can start setting up your annotation campaign.
Create a transcription campaign¶
First, navigate back to your Callico’s project details page by using the navbar at the top of the page and clicking on your project name.
From there, you can click on the Create action in the Campaigns section of the menu on the left side of the project details page:
Then, fill in the creation form as presented below:
- Name - Name of your campaign, you can copy the name from the screenshot above.
- Mode - Mode of your campaign, you have to pick the
Transcription
one to follow this tutorial. - Description (optional) - Description of your campaign, it supports Markdown.
Configure the newly created campaign¶
Once your transcription campaign is created, you will be redirected to its configuration page. Fill in the configuration form as presented below:
- Name - Keep it unchanged.
- Description (optional) - Keep it unchanged.
- Number of tasks to assign per volunteer - Set this field value to
10
, this will allow annotators to request tasks by batch containing 10 pages. Number of allowed assignments for available tasks- Ignore this field.- Group the transcription inputs for a lighter display during annotation - Check this option for a prettier display during annotation.
- Element types to annotate - The element types that will be transcribed in your tasks. Keep the
Text line
type and uncheck all others.
Create annotation tasks¶
After configuring your campaign, you will be redirected to its details page. From there, you can access the form to create annotation tasks by clicking the Create action in the Tasks section of the menu on the left:
Warning
Please make sure your import process is complete before creating your annotation tasks, otherwise you may miss pages while annotating.
Then, fill in the creation form as presented below:
- Element type - The element type to create your tasks on, here we are annotating
Pages
. Users- Ignore this field. Contributors will be able to request the amount of tasks they want, rather than assigning them directly.- Sequential - Keep it unchanged, pages will be annotated following their import order in Callico.
- Elements to use - Keep it unchanged, we want to annotate all the imported pages.
Maximum number of tasks per user- Ignore this field.- Create unassigned tasks - This option allows the creation of annotation tasks which will be requested by the annotators as they go, in this tutorial, you must check it.
Once the tasks are created, you will be redirected to the task list which should contain many items, one for each page to be annotated from your dataset.
You can navigate back to your Callico’s project details page by using the navbar at the top of the page and clicking on your project name.
Invite collaborators¶
After completing the first tutorial on creating image segmentation data for a training, you should already know how to invite contributors to your project and/or have a contributor account at your disposal. If you need a quick reminder, you can read the dedicated section on the other tutorial page.
Annotate the transcription tasks¶
In this section, we will put ourselves in the shoes of a Contributor
user whose role is to annotate tasks from one or more campaigns.
Request annotation tasks¶
As a contributor of the project, you can request tasks as explained in the dedicated section of the segmentation tutorial. You can either click on the My tasks blue button to select 1 task to annotate or the Request tasks grey button to receive a batch of 10 tasks at once.
Annotate your tasks¶
Now that you know how to request tasks, you will learn how to annotate transcription tasks. Here is an annotation page:
Transcribe all elements¶
When working on transcription campaigns, you need to transcribe the elements highlighted in green on the image.
In our case, all displayed elements are Text lines
that we have previously segmented ourselves. While transcribing, a blue visual aid is displayed to map each annotation input to an element from the image.
Mark a transcription as uncertain¶
If you are not completely sure about one of your transcriptions, you can mark your answer as uncertain by clicking on the ! yellow square button displayed next to the input you are working in.
Other tools on the image component¶
A few other tools are available to ease the annotation process:
- A slider to
Zoom in
orZoom out
the image being worked on, - An
Open in a new tab
tool to better visualize large images, - Two
Rotate left
andRotate right
tools to pivot your image.
Warning
Do not forget to validate your task by clicking the Submit green button when you are done annotating.
Correct an annotated task¶
If you have submitted a task without finishing your annotation or want to correct transcribed lines, you can edit it by going to the Annotated
tab in your task list and clicking the Change annotation green button:
You will be redirected to the task annotation page, pre-filled with the last annotation you made:
In this case, we can correct the transcriptions marked as uncertain, remove the associated markers by clicking the ! red square button, and submit a new version for our task:
The last version of an annotation task is the one that is exported to the provider, the one published back to Arkindex in our tutorial.
Track and export annotations back to Arkindex¶
Info
If necessary, logout from your Contributor
account and login with your first email address.
Back to your Manager
account, you can track the progress of your transcription campaign from its details page:
Once it is completed, i.e. when all tasks from this tutorial are annotated, you can proceed with the export to Arkindex.
Export results to Arkindex¶
To export your results back to Arkindex, you will need to click on the To Arkindex action in the Export results section of the menu on the left of the campaign details page.
Then, fill in the export form as presented below:
- Process name - Name of your process to export annotations to Arkindex.
- Status of tasks to be exported - Pick the
Annotated
value to export your tasks. Force the republication of annotations- Ignore this field.Publish each annotation separately- Ignore this field.
Track your export’s progress¶
Once you have started the results export, you will be redirected to a new page where you can track its progress. Note that this page is not dynamically refreshed. You will need to reload it manually to see updated status and logs. When the export is complete, its status will be updated to Completed
.
Check that your Arkindex export went smoothly¶
Once the export process is complete, you should check that the annotations for your transcription tasks have been properly published to Arkindex by browsing your dataset elements:
Congratulations, you have successfully transcribed lines in Callico and exported the annotations back to Arkindex!
Next step¶
Now that the ground truth has been annotated on Callico and collected in Arkindex, you are ready to train a Machine Learning transcription model.