Data model
This section is an introduction to the main concepts used in Arkindex to model a document hierarchy, the document structure and its content. For a deeper dive, please look into the Projects section.
Images¶
Arkindex is a document image processing platform. The starting point of a project is a collection of documents images. All the images in Arkindex are provided by a IIIF server, either the IIIF server integrated in Arkindex or an external server. The first step in a project is to import the images into Arkindex, either using the web interface or using an external storage.
Elements¶
In Arkindex, elements are the building blocks for both the hierarchical organisation of documents and the analysis of image structure. Any number of element types can be created, depending on the project. Element types are project-specific.
Elements are used to model the document hierarchy. For example, to organise the census images for a French department in 1921, 1926 and 1936, the following elements could be defined:
- Year (folder) to organise all the census registers for a given year
- Town (folder) to organise the pages of a register for one town and one year
- Page: the pages of the register
Elements are also used to model the structure of a page. The structure elements depend on the type of document. The default elements are Page, Text Zone, Paragraph, Text Line and Word.
Using the example of census tables, the table structure could be modelled using the elements :
- Header
- Table Body
- Table Line
- Cell
Element types must be sufficiently generic to appear many times in a corpus and must, in most cases, be locatable on the image.
Info
From a machine learning point of view, elements are generated by image segmentation or object detection algorithms.
If you want to know more about Elements, please read the dedicated section.
Classes¶
Classes are attributes of elements. They allow you to specify an element by giving it a more precise characteristic. Classes are project-dependent.
Continuing with the example of census processing, the use of classes would allow the types of page to be specified more precisely: cover page, list page, summary page, blank page, and so on. Page processing could then be different depending on the type: ignore the page, recognise individuals, etc.
Info
From a machine learning point of view, classes are generated image or text classifiers.
Transcription¶
Any type of element can be enriched with a transcription. In most cases, transcriptions are added to textual structure elements (paragraph, line of text), but it is possible to add a transcription to a photograph to describe it, for example.
Info
From a machine learning point of view, transcriptions are generated OCR (Optical Character Recognition) or HTR (Handwritten Text Recognition) systems.
Entities¶
Entities in Arkindex correspond to the concept of a named entity, i.e. a linguistic expression referring to proper nouns or a pre-defined repository. In arkindex, entities are only defined by reference to a text, present either in a transcription (on an element) or in a metadata (also on an element). Entities are therefore generally used on text.
The mention of an entity in a text is called an transcription entity in Arkindex. An transcription entity is made up of its position in the text (start and end characters), its type and the link to the entity to which the mention refers.
Info
From a machine learning point of view, Transcription Entities are generated Named-entity recognition (NER) systems.
Meta-data¶
Metadata can be added to all element types and entities. Metadata consists of key-value pairs, which are completely free and can also be typed to constrain a format (text, date, number, url). Metadata can be used to store external data on documents imported into Arkindex.