Arkindex 1.9.0

A new Arkindex release is available.

To upgrade a development instance, follow this documentation.

To upgrade a production instance, you need to:

Deploy this release’s Docker image: registry.gitlab.teklia.com/arkindex/backend:1.9.0
Run the database migrations: docker exec ark-backend arkindex migrate
Update the system workers: docker exec ark-backend arkindex update_system_workers

The release notes for Arkindex 1.9.0 are available here.

The main changes impacting developers and system administrators are detailed below.

Entity removal

The Entity table has been removed, and TranscriptionEntities are now directly linked to entity types. This can save a significant amount of disk space, as in most cases, one Entity existed for each TranscriptionEntity. On projects that include an entity recognition step on all documents, this can reduce the database size by up to 40%.

Long database migration

Two database migrations are required to make this change, and one has to update every row in the TranscriptionEntity table to switch from entity IDs to entity type IDs. The documents.0029_migrate_entity migration can thus take a long time, and may require significant backend downtime to execute to completion. This migration can however be executed normally through arkindex migrate.

Manual deduplication

The database migrations will add a new unique constraint to ensure that only one TranscriptionEntity with a given entity type can be declared, at the same position, on the same transcription, and with the same WorkerRun. There previously was a constraint to require unique entities, but there could have been multiple distinct entities of the same entity type.

If multiple entities of the same entity type exist in those conditions, the documents.0030_drop_entity migration can fail with the following error:

django.db.utils.IntegrityError: could not create unique index "unique_transcription_entity"
DETAIL:  Key (transcription_id, type_id, "offset", length, worker_run_id)=(a07c6e1f-097b-401a-9041-925415df1b5d, d498d6c9-0763-4130-95eb-3d4bda72ef43, 100, 4, null) is duplicated.
CONTEXT:  parallel worker

Manual intervention on the database, for example through arkindex dbshell, will be necessary to deduplicate the entities, as Arkindex cannot assess on its own whether this data is vital to a project or not.

To list every element ID for which there are duplicate entities that require attention, you can use the following query:

SELECT DISTINCT any_value(element_id)
FROM documents_transcriptionentity te
INNER JOIN documents_transcription t ON t.id = te.transcription_id
GROUP BY transcription_id, type_id, "offset", length, te.worker_run_id
HAVING COUNT(*) > 1;

If none of the duplicates are of any importance, you can deduplicate across the whole database with the following query:

DELETE FROM documents_transcriptionentity
WHERE id IN (
    SELECT duplicates.id FROM (
        SELECT
            id, row_number() OVER (
                PARTITION BY transcription_id, type_id, "offset", length, worker_run_id
                ORDER BY id
            ) AS position
        FROM documents_transcriptionentity
    ) AS duplicates
    WHERE duplicates.position > 1
);

Full reindexation required

It was possible to search through entities using the Solr search feature. Breaking changes had to be made to the search index, and a full reindexation will be required. This can be run with the arkindex reindex --all --drop command.

Until this reindexation is executed, the search results will not include entities, and the search API will not be able to return any facets, not just the facets related to entities.