Aldrovandi

Context

One of the main manuscript funds possessed by the University of Bologna is the one of Ulisse Aldrovandi, a famous naturalist who lived in the 16th century and who is considered to be the father of natural history studies.
Prof. Monica Azzolini, one of the main experts of Aldrovandi manuscripts at the University of Bologna, expressed the need for automatic transcription in order to speed up the process for the on-going project of Edizione Nazionale Aldrovandi. In particular, she asked for an automatic transcription of Aldrovandi’s own handwriting, but since there is no existing annotation on those documents, we opted to train two models on two different hands that are frequently found in the collection of manuscripts.
Aldrovandi’s manuscripts have been progressively uploaded on the digital library of the university AMS Historica using IIIF, while a census of the whole collection of manuscripts is available on Manus Online. In 2001, a web portal containing the transcription of some letters was published: here we can find the mere transcription in txt, but not the direct comparison with the original page.
Taking all these factors into account, we have decided to work on creating two models based on the hands of two different copyists, selecting those for which we had the most data available. Additionally, we have developed a prototype digital library using the "sites" feature of Transkribus.

Workflow

The dataset of images regarding Calzolari’s letters was downloaded, through a request to the IIIF Image API, directly from the IIIF manifest stored in AMS Historica, setting the image quality to maximum. On the other hand, we asked directly to the person in charge of digitization at University of Bologna for the images related to Manuscript 99, since it is still in the process of being digitized by an external company. In order to get good samples we did not consider pages that were slightly damaged and in which ink bleed was present. Moreover, we did not consider pages with complex layouts or in which there were lots of missing annotated words in the transcription.
In our case, the ground truth needed to provide input to the supervised machine learning-based model PYLaia HTR is represented by transcriptions done by experts in the field. The transcriptions used here were already available on the web; both of them presented expanded abbreviations even though they were originally abbreviated in the manuscript.
Considering all of these aspects, we defined which goals the model should achieve:

automatically recognize text and layout;
automatically recognize abbreviations and the correspondent word(s).

In order to achieve an automatic recognition of text and layout, we relied on previously trained models:

For all the documents we used the automatic layout recognizer “Universal Lines” that gave us a good starting point even though it needed some manual corrections. In particular, we noticed that the lines were not correctly recognized in the “signature” part of the letters, and the page layout did not include page numbers or cross-reference marks.

After having prepared our documents, we trained the models using as a base the already existing “Italian Administrative Hands 1550-1700” and tuning some parameters using the additional features of Transkribus Expert. In particular we noticed that keeping the standard batch size used by Transkribus (24) was not effective at all, so for all the processes we chose to diminish it to 4 and saw a great improvement in performance. We chose to diminish the number of epochs to 100 because we were dealing with a smaller number of words and our models were already relying on a starting one.

The graphs showcase the learning curves of both the training and validation set, which represent how much the Character Error Rate value is diminishing over the epochs. Generally speaking, if CER is over 10% the transcriptions are not so accurate because they provide a model that requires lots of manual corrections. For this reason, a trial and error approach has been fundamental to get to these results which are the best we could achieve given our dataset, while adjusting the parameter values. As it can be noticed by looking at the graphs, in the generated models the validation curve for CER is slightly higher than the other. This may indicate overfitting, suggesting that by reducing the batch size, the model has closely memorized the training data but struggles to generalize effectively to unseen data. In order to prevent this phenomenon from happening, a bigger dataset is surely needed along with tuning the correct parameters.

BiancolinusHandwriting

The first model was trained on the handwriting of Andrea Biancolino, one of the most prolific Aldrovandi’s copyists and one of the few that has signed some documents. We used as ground truth for annotation the Manuscript 99, composed of 90 pages, because it was the only one that already possessed a complete transcription. The manuscript is mainly written in Italian but some parts, mainly quotes, are written in Latin, as a consequence the model created ideally transcribes both languages. Moreover, thanks to Manus Online we discovered which other manuscripts Biancolino transcribed both in Italian and Latin. Having discovered this, we can now estimate that the model could be reused for transcribing seven more manuscripts. The Character Error Rate (CER) value for this model is 7.50% with a training set size of 9674 words.

The Character Error Rate (CER) value for this model is 7.50% with a training set size of 9674.
The image instead represents the comparison which highlighted a CER of 11.26% between the ground truth transcription and the predicted one by the model.

Comparison between the model’s prediction and the ground truth (pag.88).

Both from the comparison and this latest test we noticed that the model is not so efficient when it comes to abbreviations, and in particular words that are written without whitespaces among them. Moreover, we noticed a few minor errors that occurred frequently: the transformation of the first letter from capital letter to lowercase; missing punctuation, in particular stress marks and apostrophe.

How the model behaves on an unseen text, page taken from Ms.139.

CalzolariHandwriting

The second model was trained on the handwriting of another anonymous copyist. We made this choice because we had a substantial number of already transcribed pages from Francesco Calzolari's letters available on “Teatro della Natura di Ulisse Aldrovandi” portal. The manuscript examined is 38/2.3 present in IIIF format on AMS Historica. We focused on the correspondence 26r-72v, Verona, 1554, containing Calzolari letters. However, since the name of the copyist is unknown in this case, we cannot determine to which extent this model can be further reused. The Character Error Rate (CER) value for this model is 16% with a training set size of 10565 words.

We also used Transkribus Expert to compute the comparison between the two examples and the CER was 18.10%. The main errors appear to be similar to those of Model 1: abbreviations are not correctly recognized and sometimes prepositions and words that do not possess whitespace between them are not separated by the model. In general, this model can be considered as a good starting point for a transcription, while it needs further manual refinement in order to get to the complete correct result.

Comparison between the model’s prediction and the ground truth (p.43).

DECIPHERING
ALDROVANDI’S COPYISTS
HANDWRITING USING TRANSKRIBUS

Context

Documents

Ms.99

Calzolari's letters

Workflow

BiancolinusHandwriting

CalzolariHandwriting

Transkribus sites

Conclusion

DECIPHERING ALDROVANDI’S COPYISTS HANDWRITING USING TRANSKRIBUS

Context

Documents

Ms.99

Calzolari's letters

Workflow

BiancolinusHandwriting

CalzolariHandwriting

Transkribus sites

Conclusion

DECIPHERING
ALDROVANDI’S COPYISTS
HANDWRITING USING TRANSKRIBUS