Conclusion

Throughout this tutorial, we have guided you step-by-step on how to effectively use Arkindex and Callico for automatic document processing.

You learned how to import your documents into Arkindex and annotate them in Callico for two key document processing tasks: document layout analysis (DLA), and handwritten text recognition (HTR). You also explored how to bring these annotations back into Arkindex to train models for segmentation using YOLO, as well as for text recognition using PyLaia. Finally, we covered how to run these trained models on new data within Arkindex and export the results in PAGE XML format.

In this section, we will discuss the limitations of the trained models and how they can be improved.

Limitations¶

The models trained in this tutorial are based on annotations from a relatively small dataset, as we annotated only 100 pages. While this is sufficient for demonstration purposes, the limited data may impact the performance and accuracy of the models.

Specifically, models trained on small datasets may struggle to recognize diverse layouts or handwriting styles, and their ability to generalize to new documents might be limited.

How do I improve my models?¶

The models trained in this tutorial could be improved in several ways:

Increase the dataset size: The main reason they do not perform optimally is that they are trained on a small dataset (100 pages). Small datasets often lead to overfitting, where models memorize the data rather than learning generalizable patterns. Performance would improve with more extensive data.
Diversify your data: Including a variety of document styles, layouts, and handwriting samples in your training set will help the model generalize better to different scenarios. Check out our publicly available datasets, either on Arkindex or HuggingFace.
Start from a pre-trained model: A few models are already available in Arkindex or HuggingFace, with more to come. Adapting a pre-trained model to your dataset could also help with generalization.
Hyperparameters tuning: Hyperparameters were not optimized in this tutorial. Adjusting elements such as data pre-processing, data augmentation, optimizer selection, and the early stopping strategy could boost performance.
Adopt more recent models: Finally, while the models used in this tutorial are effective, more recent architectures may yield better results.

As experts in the field, we can help you design, train, and optimize models designed to your specific needs.

Thank you for following along with this tutorial. We hope it has provided you with the knowledge and tools necessary to fully utilize Arkindex and Callico for your document processing needs. We encourage you to continue experimenting with these platforms. And please, contact us if you have any questions!

Where do I find the models and datasets of this tutorial?¶

The datasets are published, in the right format, on HuggingFace:

link to the image segmentation dataset,
link to the line transcription dataset.

The models are also published on HuggingFace:

link to the image segmentation model,
link to the line transcription model.

Head over there to learn how to use them outside the scope of this tutorial.