Database exports

Projects can be exported as SQLite databases. This page describes its structure and provides usage examples.

To learn how to export a project as a database, see Export a project as a database.

A Python library is available to help in using these exports within Python code.

Database exports are a more advanced format intended for developers and data engineers.

To export a project in a simpler format better suited for other users, see Export a project as a document.

Database structure

graph LR element_path --> element element --> image element --> worker_run transcription --> element transcription --> worker_run classification --> element classification --> worker_run transcription_entity --> transcription transcription_entity --> entity_type transcription_entity --> worker_run metadata --> element metadata --> entity metadata --> worker_run worker_run --> worker_version image --> image_server dataset_element --> element dataset_element --> dataset

export_version

Name Type Description

version

INTEGER

Version number of the database structure. Currently set to 11.

classification

Name Type Description

id

VARCHAR(37)

UUID of the classification.

element_id

VARCHAR(37)

UUID of the element the classification applies to.

class_name

VARCHAR(1024)

Name of the class applied by the classification.

state

VARCHAR(16)

Moderation state of the classification. Either pending, validated or rejected.

moderator

VARCHAR(255)

Email of a user that validated or rejected the classification. Null when there has been no moderation.

confidence

REAL

Confidence level, between 0 and 1.

high_confidence

INTEGER

Whether the worker states this class is the correct class for the element. Either 0 or 1.

worker_run_id

VARCHAR(37)

UUID of a worker run that created this classification. Nullable.

dataset

Name Type Description

id

VARCHAR(37)

UUID of the dataset.

name

VARCHAR(100)

Name of the dataset.

description

TEXT

Description of the dataset.

state

VARCHAR(50)

State of the dataset. Either open, building, complete or error.

sets

TEXT

Comma-separated list of allowed set names.

dataset_element

Name Type Description

id

VARCHAR(37)

UUID of the dataset-element relationship.

dataset_id

VARCHAR(37)

UUID of the dataset the element belongs to.

element_id

VARCHAR(37)

UUID of the element contained in the dataset.

set_name

VARCHAR(50)

Name of the set within the dataset where the element is.

element

Name Type Description

id

VARCHAR(37)

UUID of the element.

created

REAL

Creation date of the element, as a UNIX timestamp.

updated

REAL

Last update date of the element, as a UNIX timestamp.

name

VARCHAR(250)

Name of the element.

type

VARCHAR(50)

Slug of the element’s type.

image_id

VARCHAR(37)

UUID of the image the element is on. Nullable.

polygon

TEXT

Coordinates of the element’s polygon on the image, as a JSON array of arrays of floats. Null if the element is not on an image.

rotation_angle

INTEGER

Clockwise angle of rotation to apply on the image after cropping and mirroring.

mirrored

INTEGER

Whether or not to mirror the image along its vertical axis after cropping and before rotating.

worker_run_id

VARCHAR(37)

UUID of a worker run that created this element. Nullable.

confidence

REAL

Confidence level reported by the worker, between 0 and 1. Nullable.

element_path

Name Type Description

id

VARCHAR(37)

UUID of the element path.

parent_id

VARCHAR(37)

UUID of the parent element.

child_id

VARCHAR(37)

UUID of the child element.

ordering

INTEGER

Position of the child within the parent.

entity_type

Name Type Description

id

VARCHAR(37)

UUID of the entity type.

name

TEXT

Name of the entity type.

color

VARCHAR(6)

Color used for the entity type in the frontend.

image

Name Type Description

id

VARCHAR(37)

UUID of the image.

url

TEXT

URL leading to the image’s IIIF identifier.

width

INTEGER

Width of the image in pixels. Can be 0 if it is unknown.

height

INTEGER

Height of the image in pixels. Can be 0 if it is unknown.

server_id

INTEGER

Integer ID of the server that hosts this image.

image_server

Name Type Description

id

INTEGER

Integer ID of the IIIF image server.

display_name

VARCHAR(250)

Human-readable name of the server.

url

TEXT

Base URL for all images in this server.

max_width

INTEGER

Maximum width of any image that the server will return. Nullable.

max_height

INTEGER

Maximum height of any image that the server will return. Nullable.

metadata

Name Type Description

id

VARCHAR(37)

UUID of the metadata.

element_id

VARCHAR(37)

UUID of the element this metadata belongs to.

name

VARCHAR(250)

Name of the metadata.

type

VARCHAR(50)

Type of the metadata: text, date, reference, etc.

value

TEXT

Value of the metadata.

worker_run_id

VARCHAR(37)

UUID of a worker run that created this metadata. Nullable.

transcription

Name Type Description

id

VARCHAR(37)

UUID of the transcription.

element_id

VARCHAR(37)

UUID of the element this transcription is attached to.

text

TEXT

Text of the transcription.

confidence

REAL

Confidence level reported by the worker, between 0 and 1. Nullable.

orientation

TEXT

Text orientation: horizontal-lr, horizontal-rl, vertical-lr or vertical-rl.

worker_run_id

VARCHAR(37)

UUID of a worker run that created this transcription. Nullable.

transcription_entity

Name Type Description

id

VARCHAR(37)

UUID of the transcription-entity link.

transcription_id

VARCHAR(37)

UUID of the transcription.

type_id

VARCHAR(37)

UUID of the entity type.

offset

INTEGER

Zero-based index of the first character in the transcription’s text that maps to the entity.

length

INTEGER

Length of the portion of the transcription’s text that maps to the entity.

worker_run_id

VARCHAR(37)

UUID of a worker run that created this transcription-entity link. Nullable.

confidence

REAL

Confidence level reported by the worker, between 0 and 1. Nullable.

worker_version

Name Type Description

id

VARCHAR(37)

UUID of the worker version.

name

VARCHAR(100)

Human-readable name of the worker.

slug

VARCHAR(100)

Slug of the worker.

type

VARCHAR(50)

Type of the worker (recognizer, classifier, etc.).

version

INTEGER

Version number for this worker version, when it does not have a Git commit. Nullable.

revision

VARCHAR(200)

Full URL of a Git commit for this version, when the worker is based on a repository. Nullable.

repository_url

TEXT

URL of the Git repository that hosts this worker version. Nullable.

Either version, or both revision and repository_url must be set. All three fields cannot be set at once.

worker_run

Name Type Description

id

VARCHAR(37)

UUID of the worker run.

worker_version_id

VARCHAR(37)

UUID of the worker version of this worker run.

model_version_id

VARCHAR(37)

UUID of the selected model version. Nullable.

model_id

VARCHAR(37)

UUID of the model of the selected model version. Nullable.

model_name

VARCHAR(100)

Name of the model of the selected model version. Nullable.

configuration_id

VARCHAR(37)

UUID of the selected worker configuration. Nullable.

configuration

TEXT

Contents of the selected worker configuration, as JSON. Nullable.

Sample queries

Below are some sample SQL queries you can run on this database.

Listing elements by type

SELECT *
FROM element
WHERE type = 'page';

Listing child elements recursively

WITH RECURSIVE child_ids (id) AS (
        SELECT child_id
        FROM element_path
        WHERE parent_id = '0ac0a631-dddd-4fbe-9c9c-a5d803b3bc94'
    UNION
        SELECT child_id
        FROM element_path
        INNER JOIN child_ids ON (element_path.parent_id = child_ids.id)
)
SELECT *
FROM element
INNER JOIN child_ids USING (id);

To learn more about the recursive query syntax used here, see the official SQLite documentation on Recursive Common Table Expressions.

List all transcriptions on direct children

This assumes that all the transcriptions are on children with a text_line type.

SELECT transcription.*
FROM transcription
INNER JOIN element ON (element.id = transcription.element_id)
INNER JOIN element_path ON (element_path.child_id = element.id)
WHERE element_path.parent_id = '0ac0a631-dddd-4fbe-9c9c-a5d803b3bc94'
AND element.type = 'text_line';

List transcription entities on an element

SELECT entity_type.name, SUBSTR(transcription.text, transcription_entity.offset + 1, transcription_entity.length)
FROM transcription
INNER JOIN transcription_entity ON transcription_entity.transcription_id = transcription.id
INNER JOIN entity_type ON entity_type.id = transcription_entity.type_id
WHERE transcription.element_id = '0ac0a631-dddd-4fbe-9c9c-a5d803b3bc94';

This will return the entity type and the portion of text within the transcription, for each TranscriptionEntity in any transcription of the given element.

Our Python library provides this text splitting through a simpler helper named TranscriptionEntity.text:

from pathlib import Path
from arkindex_export import open_database, EntityType, TranscriptionEntity, Transcription

open_database(Path("./database.sqlite"))

query = (
    TranscriptionEntity
    .select(EntityType.name, TranscriptionEntity.text)
    .join(Transcription)
    .join(EntityType)
    .where(Transcription.element_id = '0ac0a631-dddd-4fbe-9c9c-a5d803b3bc94')
)
for type_name, text in query.tuples():
    print(f"{type_name}: {text}")