Blog

From Database to Knowledge Graph: The Septuagint Catalogue and Wikidata

Jonathan Groß
March 25, 2024

 

In the digital age, research projects are not only able but also required to make parts of their research data available to the public in an accessible way. For this reason the first phase of the Psalter project (2020–2023) focused on creating and publishing a descriptive catalogue of the Septuagint manuscripts. This catalogue is available online and proves useful not only to researchers but to anyone who might be interested in the history of the Greek Bible.

However, the applications of the database go beyond simple information seeking. An example is the sharing of select data with other databases. Of course, Pinakes – the most comprehensive database for Greek manuscripts – is the most important party in this regard, and matching between the two databases has already begun, but another project has also started ingesting data from the Septuagint Catalogue: Wikidata, the database which ties together Wikipedia and its sister projects.

Wikidata is one of the more recent projects in the Wikimedia family. Started in 2012 as a database to connect the many language editions of Wikipedia as well as its sister projects (Wikisource, Wikimedia Commons, Wiktionary etc.), it has since evolved into what is known as a knowledge graph, which is a database hosting interconnected items which are described through meaningful statements. Like Wikipedia, Wikidata is run by volunteers, with technical support from Wikimedia Deutschland, the German chapter of the international Wikimedia Foundation.

Wikidata aims at describing real-world concepts (such as persons, locations, works of art, publications, historical objects) in a way that is both comprehensible to humans and interpretable by machines. Within Wikidata, these concepts are represented through dedicated items which have a descriptive ‘fingerprint’ (consisting of labels, descriptions and aliases in different languages) as well as what are called ‘statements’. These statements are stored in an RDF triple format (subject-verb-object) where the subject is the main item in question, the verb a quality of this item (stated through an entity named ‘property’) and the object is a value with a specific data type (another item, a date, a string of text, a quantity, an external identifier etc.). Each statement can have multiple values which can be further differentiated using qualifiers (stating, for example, the time where a statement was true), references to support statements and a 3-tier ranking system for the truthfulness of a value. Besides the default ‘normal’ rank, a ‘preferred’ or ‘deprecated’ rank may be assigned to values which are considered to be most or least accurate. Besides these custom values, a statement can also have ‘no value’ (if none exists) or an ‘unknown value’.

All these terms can be reviewed in the diagram below:

Data Model of Wikidata

"Datamodel in Wikidata.svg" (by Charlie Kritschmar, CC0 1.0)

Statements in Wikidata are accessible in a number of ways. The most intuitive is the internal search function (currently Cirrus search), but Wikidata is also indexed by search engines and can be accessed through them. The most reliable way to produce meaningful results, however, is by using SPARQL queries which can access Wikidata content directly and yield results according to specific questions. For example, if you would want to know how many historical manuscripts have Wikidata items, a simple query for “instance of:manuscript” renders a list of more than 127,000 items.1 If you would like to know how many of these manuscripts have a Diktyon number assigned to them on Wikidata (“instance of:manuscript” AND “Diktyon number”), a search for this combination yields 2,891 items.2

Queries like these can be built with ease using the Wikidata query builder.3 However, to take full advantage of the diverse features of the query service, one needs to learn SPARQL.

In regards to historical manuscripts, a number of libraries and archival institutions have initiated projects involving Wikidata. From April 2015 to March 2016, the Bodleian Library employed a ‘Wikimedian in Residence’ (Dr. Martin Poulter) to work with the library staff. As a result, many Bodleian manuscripts now feature on Wikidata with curated data coming directly from the library. More recently (September 2023) a project for historical manuscripts was initiated by members of the Wikidata community. The WikiProject Manuscripts4 aims at further developing the Data Model for manuscripts, enhancing data quality for manuscript items and gathering data from additional resources. To facilitate data exchange, members of the community reached out to research projects and database hosts. One early test case was the import of the Rahlfs Catalogue of Septuagint manuscripts , which was overseen by the author of this blogpost.

As a first step, I matched the manuscript items in the Catalogue with those on Wikidata. The Septuagint Catalogue comprises more than 2300 manuscripts and about 1600 of those are already published. A list of these, featuring their canonical Rahlfs numbers as well as Diktyon and Trismegistos IDs, was provided by the Psalter project and matched to Wikidata through the use of OpenRefine. It turned out that 69 Septuagint manuscripts already had Wikidata items. This number does not include the baker’s dozen of articles on smaller papyrus remnants which have a Wikipedia article (and consequently a Wikidata item ID) but are not yet published online in the Septuagint Catalogue. For the remaining 1530 of the 1599 manuscripts (published at the time), new Wikidata items were created. OpenRefine allows for automated creation with basic statements: Each item was created with a label and a generic description (“Greek manuscript of the Septuagint”) in English, German, French and Italian, as well as basic statements like “instance of:manuscript”, “language of work or name:Ancient Greek”, and most importantly, the external identifiers for Pinakes (the Diktyon numbers), Trismegistos and the Rahlfs Catalogue. The Rahlfs numbers were integrated in two ways: first with the statement “catalog code:[Rahlfs number]” qualified with “catalog:Rahlfs catalog”, and then with a custom identifier statement “Rahlfs catalogue” which directly links to the online Catalogue and also serves as a reference for some statements in the Wikidata item.

You would be right to ask: Cui bono? What is the use of creating a bunch of redundant items with limited information in Wikidata? There are many possible answers, but I will just outline two.

On the one hand, the Wikimedia projects profit directly from this matching. They receive quality data to build upon and reuse. One practical application is the inclusion of the Septuagint Catalogue in the “External links” section of Wikipedia articles on individual manuscripts (or in the list of Septuagint manuscripts which currently exists in six languages: German, English, Spanish, Polish, Russian and Indonesian). Many of the more prominent Septuagint manuscripts have had Wikipedia articles for a long time, especially the more prominent ones like the Joshua Scroll (Ra 661), the Codex Vaticanus (Ra B), the Khludov Psalter (Ra 1101) or the famous Codex Sinaiticus (Ra S) – the German version of which is currently being rewritten for the 40th instalment of the German Wikipedia Schreibwettbewerb. Even lesser known exemplars like Papyrus Chester Beatty V (Ra 962) have received attention.5 Because the Rahlfs numbers for these manuscripts are now recorded on Wikidata in the items referring to these Wikipedia articles, they can be parsed from Wikidata to Wikipedia. I created such templates doing just that for the three Wikipedia editions with the most articles on individual manuscripts. Hence, since October 2023, the Rahlfs Catalogue is prominently shown in the “External links” section of 96 Wikipedia articles (43 in German, 31 in English and 22 in Polish).

On the other hand, the matching between Wikidata and other databases provides a learning opportunity for the Digital Humanities. Codicology has made tremendous advancements in the 20th and 21st centuries, and Data Science has yet to tackle the challenge of turning the research consensus (if one can be found) into reliable data models. To give just one example: There is no agreement on what a manuscript actually is.6 The Pinakes database is based on library catalogues and defines a manuscript as a bound volume (or fragment of one) as housed in a holding institution. The Rahlfs Catalogue, on the other hand, conceives of manuscripts as historical objects which sometimes have to be reconstructed (especially in the case of dismembered manuscripts like Ra S). This discrepancy becomes visible when comparing the IDs in the catalogues, which is one of the results of the Wikidata matching. There are quite a few Septuagint manuscripts which have several Diktyon numbers but only one Rahlfs number. Perhaps the most extreme case is Ra 456 which has 21 Diktyon numbers assigned to it. These are related to the various fragments of the manuscript housed in five different institutions across three countries (the bulk is in Messina, Biblioteca Regionale Universitaria “Giacomo Longo”). This discrepancy is a consequence of competing concepts of what constitutes a “manuscript”. Basically, the 21 scraps recorded in Pinakes are “manuscript fragments” (or “severed” viz. “defective codicological units”, according to Gumbert) while Ra 456 is a “dismembered codex”. Ideally, in Wikidata, each of the 21 fragments would receive a dedicated item with a statement that they are “part of:Ra 456”, while the item for Ra 456 whould link to these fragments using the property “has part(s)”.

This, however, is only part of the solution. A clear-cut data model for manuscripts, their constituent parts and their properties, which at the same time treats them seriously as historical objects, is still a long way down the road.

2 https://w.wiki/9WrD

3 https://query.wikidata.org/querybuilder/

4 https://www.wikidata.org/wiki/Wikidata:WikiProject_Manuscripts

5 https://de.wikipedia.org/wiki/Papyrus_Chester_Beatty_V

6 For a survey of the field and a critique of the different approaches towards describing manuscripts, see Patrick Andrist, Paul Canart, Marilena Maniaci: The Syntax of the Codex: Towards a Structural Codicology, Turnhout: Brepols (forthcoming – an updated English edition of their 2013 monograph La syntaxe du codice); Patrick Andrist and Marilena Maniaci, "The Codex’s Contents: Attempt at a Codicological Approach", in: Jörg B. Quenzer (ed.), Exploring Written Artefacts. Objects, Methods and Concepts. Vol. 1 (Studies in Manuscript Cultures 25), Berlin/Boston: de Gruyter 2021, 1–19.