Arkindex 1.9.0
A new Arkindex release is available.
To upgrade a development instance, follow this documentation.
To upgrade a production instance, you need to:
-
Deploy this release’s Docker image:
registry.gitlab.teklia.com/arkindex/backend:1.9.0
-
Run the database migrations:
docker exec ark-backend arkindex migrate
-
Update the system workers:
docker exec ark-backend arkindex update_system_workers
The release notes for Arkindex 1.9.0 are available here.
The main changes impacting developers and system administrators are detailed below.
Entity removal
The Entity table has been removed, and TranscriptionEntities are now directly linked to entity types. This can save a significant amount of disk space, as in most cases, one Entity existed for each TranscriptionEntity. On projects that include an entity recognition step on all documents, this can reduce the database size by up to 40%.
Long database migration
Two database migrations are required to make this change, and one has to update every row in the TranscriptionEntity table to switch from entity IDs to entity type IDs. The documents.0029_migrate_entity
migration can thus take a long time, and may require significant backend downtime to execute to completion. This migration can however be executed normally through arkindex migrate
.
Manual deduplication
The database migrations will add a new unique constraint to ensure that only one TranscriptionEntity with a given entity type can be declared, at the same position, on the same transcription, and with the same WorkerRun. There previously was a constraint to require unique entities, but there could have been multiple distinct entities of the same entity type.
If multiple entities of the same entity type exist in those conditions, the documents.0030_drop_entity
migration can fail with the following error:
django.db.utils.IntegrityError: could not create unique index "unique_transcription_entity" DETAIL: Key (transcription_id, type_id, "offset", length, worker_run_id)=(a07c6e1f-097b-401a-9041-925415df1b5d, d498d6c9-0763-4130-95eb-3d4bda72ef43, 100, 4, null) is duplicated. CONTEXT: parallel worker
Manual intervention on the database, for example through arkindex dbshell
, will be necessary to deduplicate the entities, as Arkindex cannot assess on its own whether this data is vital to a project or not.
To list every element ID for which there are duplicate entities that require attention, you can use the following query:
SELECT DISTINCT any_value(element_id)
FROM documents_transcriptionentity te
INNER JOIN documents_transcription t ON t.id = te.transcription_id
GROUP BY transcription_id, type_id, "offset", length, te.worker_run_id
HAVING COUNT(*) > 1;
If none of the duplicates are of any importance, you can deduplicate across the whole database with the following query:
DELETE FROM documents_transcriptionentity
WHERE id IN (
SELECT duplicates.id FROM (
SELECT
id, row_number() OVER (
PARTITION BY transcription_id, type_id, "offset", length, worker_run_id
ORDER BY id
) AS position
FROM documents_transcriptionentity
) AS duplicates
WHERE duplicates.position > 1
);
Full reindexation required
It was possible to search through entities using the Solr search feature. Breaking changes had to be made to the search index, and a full reindexation will be required. This can be run with the arkindex reindex --all --drop
command.
Until this reindexation is executed, the search results will not include entities, and the search API will not be able to return any facets, not just the facets related to entities.