ICT Projects - Voynich

The Voynich Manuscript ‒ considered to be “the most mysterious manuscript in the world” [1] [2] ‒ has been subject to intense efforts to decode its mysterious text. To date, all such attempts have been unsuccessful.

The manuscript is believed to be divided into six major categories, and this supposition is based on the illustrations on the pages that act as indicators of the presumed topics. These have been organised into the following categories: herbal, astronomical, biological, cosmological, pharmaceutical, and recipes [1].

Figure 1. A sample page from the original manuscript

The goal of this research is to attempt to examine whether the Voynich Manuscript is an elaborate hoax or indeed an unknown language. If the manuscript is the ‘real deal’, ‒ as suggested by Reddy and Knight [3] that it is written in a real, but unknown language – then, it would be legitimate to surmise that, as a rule, the text within the respective categories is more semantically related within a given category, than is the text across different categories.

To test the above hypothesis, this study adopted a statistical approach in analysing the manuscript. The research has been performed using the Extensible Voynich Alphabet (EVA) ‒ a transcription system that is digitally available to the public. Figure 2 is a snippet of the dataset using EVA and is a transliteration of the original text, a sample of which is shown in Figure 1. Since this dataset is an ongoing work and contains terms that have more than one interpretative reading, pre-processing techniques have been adopted to minimise any ambiguities. Various machine learning algorithms, including classification and clustering tools such as: support-vector machine (SVM), k-nearest neighbours ( kNN), multinomial naïve Bayes, k-means and No- k-means have been utilised. The most reliable results were achieved when text from each category was extracted by page rather than on a line-by-line basis. This is in line with expectations, because a page of text offers a wider context than a line, and when seeking to assign a page or a line to a specific topic, the former would be easier to assign, since more words would be present and can be related to a specific topic [4]. Through different metrics and diagrams, these algorithms show clear and expressive results. Similar distributions over lexical items were found among the categories, and less so between the different categories, which provides evidence of an underlying lexico-semantic categorisation that is in line with the categorisation indicated by the illustrations. This contributes towards supporting the hypothesis of the existence of a genuine message within the manuscript.

Figure 2. Dataset snippet based on an EVA transcription of the sample page in Figure 1

References/Bibliography:

[1] R. Clemens and D. Harkness, The Voynich Manuscript, New Haven, Conn./London: Yale University Press, 2016.

[2] S. Skinner, R. T. Prinke and R. Zandbergen, The Voynich Manuscript: The World’s Most Mysterious and Esoteric Codex, Watkins Publishing, 2017.

[3] S. Reddy and K. Knight, “What we know about the Voynich manuscript,” in In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Chicago, 2011.

[4] C. C. Aggarwal

Student: Adriana Camilleri
Course: B.Sc. IT (Hons.) Software Development
Supervisor: Dr. Colin Layfield
Co-supervisor: Dr. Lonneke van der Plas