Hoaxing the Voynich Manuscript, part 6: Planning the word structure

By Gordon Rugg

In this series of articles, we’re imagining that you’ve gone back in time, and that you want to produce the Voynich Manuscript as a hoax to make money. We’re looking at the problems and decisions you’d face, and at the implications of various possible solutions.

The first article looked at why a mysterious manuscript would be a good choice of item to hoax. The second article looked at some of the problems involved in hoaxing a text that looked like an unknown language, from the linguistic viewpoint. The third examined the same subject in more depth, and the fourth discussed the choice of materials, going into some detail about the choice between using freshly-made or already-old vellum. The fifth was about the layout, structure and contents of the book.

This article is about how to create a plausible-looking structure for the individual words in the text that you’re going to produce. We’ll look at the choice of script, and how to combine the words, in later articles.

A key word in the paragraph above is “plausible”. If you’re a hoaxer focused on getting the end product out as quickly as possible, then you don’t need to produce a perfect imitation of a language; you just need to produce something that’s good enough to be plausible, even if it’s not perfect.

Codebreakers have known since the tenth century that there are regularities in how frequent particular letters are within a given language. That’s a standard way of cracking the simplest codes, where each letter of the plaintext is systematically swapped for a different letter.

Codebreakers were also aware by the fifteenth century that within a given language, the lengths of words will vary, and that within a given language, some syllables and some words will be more common than others. So, if you were a hoaxer in the past, trying to produce something that looked like a language, you’d want some way of producing syllables and words of different lengths.

Any educated hoaxer in the past would almost certainly be familiar with Latin and with traditional grammar, where there’s a standard way of dividing words into three parts, namely a prefix, root and suffix. In this system, a root can stand alone as a word in its own right, like the word do in English, or it can have just a prefix, such as un, giving us undo, or a suffix such as ing, so we get doing, or both a prefix and a suffix, so we get undoing.

You could use this approach to produce words that looked superficially like English, by putting together a “pick and mix” table along the following lines.

english syllablesv2

This table includes short, medium and long examples of each type of syllable (prefix, root and suffix).

By combining different syllables from the table, we can produce gibberish English-looking words ranging in length from 6 letters (redoer) to 12 letters (omnisingable).

That example used already-existing English syllables. If you’re generating gibberish syllables from scratch, then it’s easy to slip into the trap of generating them by systematically adding a letter at a time. It’s plausible on a small scale – in English, for instance, there are the prefixes a, an and ana, all relating to an absence of something. However, if you do it on a large scale, and too systematically, then there’s a risk of looking too mechanical.

In Voynichese, there are hints of this effect. Here are some Voynichese prefixes: o, ol, olo, o, or, oro. Some Voynichese roots: k, ke, kee, t, te, and tee. Some Voynichese suffixes: y, dy and ldy.

Here’s an example, using the normal alphabet, of what a small Voynichese table produced in this way would look like.

voynich syllablesv2

As with the pseudo-English example, this would give you words of varying length, depending on which combination of prefix, root and suffix you chose.

One interesting side-effect of this method is that there’s a regularity in the distribution of word lengths that you’d see in words produced using this method. There is only one combination that will give you the shortest word, oky. There’s only one combination that will give you the longest word, olokeeldy. There are various combinations that will give you the intermediate-length words. If you plot those combinations out as a table, you see this pattern. The numbers in the bottom row show the word lengths; each column shows all the combinations that will produce that word length.

syllable lengthsv3

If that shape looks familiar, that’s because it is. It’s the start of a common statistical distribution. If we added more words to the sample, we would end up with a binomial distribution, very similar to the distribution that you see if you plot the lengths of Voynichese words. However, it’s a rare pattern if you plot the lengths of words from real languages. They’re usually skewed to one side.

It was widely argued in the past that this distinctive distribution of Voynichese word lengths would be beyond the knowledge of hoaxers in the fifteenth or sixteenth century. However, the illustrations above show how this distribution could easily arise as a completely unintended side-effect of one low-tech way of generating words of different lengths. It’s not the only case where statistical regularities emerge as unintended side-effects; we’ll encounter more in the article on using the table and grille technique.

Other issues

This gives you a quick and easy way of producing text that looks superficially like a real language. The text has a syllable structure, like a real language, with prefixes, roots and suffixes. There are syllables of different lengths, as in a real language. The syllables combine to produce words of different lengths, as in a real language.

So far, so good.

If you want to produce something that looks a lot like a language, though, you need to add some refinements. For instance, a real language will typically have a lot more roots than either prefixes or suffixes, so you’d want to incorporate a way of imitating that feature. An easy way would be simply to invent a lot of root syllables.

There are various other features that aren’t essential, but that would make the text look a lot more like a real language. For instance, you could easily mimic the feature of vowel harmony, which occurs in various languages; in these languages, there are restrictions on which vowels can occur within the same word. You could imitate this by using a table like the one below, where syllables containing an “o” are written in red, and syllables containing an “a” are written in blue.

vowel harmony2

With this table, you can use a rule of only allowing two colours within a word, so that oky or okedor would be fine, and ykedar would also be fine, but you’d never get okedar or olkdar. That would look a lot like a sophisticated complex regularity, but it would actually just require swapping ink colours a couple of times when you filled in your table.

But you don’t see either of these features in Voynichese.

For a start, the number of different root syllables in Voynichese is very low – in the order of a couple of dozen common roots, after which their frequency drops off rapidly. That’s a very low number compared to a real language.

The absence of vowel harmony is understandable – it only occurs in some languages, and most people haven’t heard of it. However, it would be easy to hoax various other features that are more widely known, such as the type of agreement between adjective and noun that you get in Latin, that would make the text look that bit more plausible as a language. You could do that with a variation on the method we’ve just seen, by using the rule that if one word ends in a red suffix, then the next word you generate after it can’t end in a blue suffix. That wouldn’t need much extra effort, but we don’t see that, or anything like it, in Voynichese.

It’s because of significant absences such as these that most serious researchers agree that Voynichese isn’t simply an unknown human language. However, if it’s easy to include them in a hoax, and thereby make the output text more plausible, then why didn’t our hypothetical hoaxer do this with the Voynich Manuscript?

I think there are two main explanations, assuming for the moment that the Voynich Manuscript is a meaningless hoax. One is that the hoaxer deliberately made a sophisticated choice not to include them. That looks like an odd decision, but actually it makes a lot of sense. By not including these features, the hoaxer gives less for any critical examiners to get hold of. The presence of features such as vowel harmony and adjective/noun agreement would narrow down the possibility space, and would thereby increase the likelihood of the manuscript being judged a hoax because there wouldn’t be enough alternative explanations for any oddities. The absence of such features leaves more possibilities open. It has the further advantage of requiring less work.

Another explanation is that the hoaxer simply didn’t bother. This is consistent with other features of the manuscript, such as the low attention to quality when the illustrations were coloured in, and the absence of corrections in the manuscript. Those point towards someone slapping together a quick and dirty hoax, in a way that ironically produced the same effect as a sophisticated decision.


It’s easy to produce plausible-looking words using a three syllable structure, with tables of syllables of varying lengths. Although this approach is very simple, it produces some surprisingly complex unintended side-effects, in terms of statistical features in the output.

It would be easy to add refinements to this method to make the result look more realistic, but the text of the Voynich Manuscript doesn’t show these refinements. These absences are also strong arguments against the manuscript simply containing an unidentified language. One possible explanation is that they are deliberate absences used by a hoaxer to conceal a hoax; another is that they are simply the side-effects of a quick and dirty hoax.

In the next articles, we’ll look at the choice of script, and at the illustrations, the combinations of words into lines of text, and at the logistics of the hypothetical hoax.


In some of my earlier work, I tried using the term “midfix” for the middle syllable in Voynichese, on the grounds that “root” implied that the “words” in Voynichese were from a real language, rather than just meaningless combinations of meaningless syllables. I’ve gone back to using “root” because it means that I don’t need to go through a discussion about what “midfix” means.

Some languages don’t use the prefix/root/suffix structure, or don’t use it in all their words. For example, in English, there are numerous verbs and a few nouns that are modified by changes in the vowel within the root syllable, as in English sing/sang/sung or woman/women.

Some languages also use multiple prefixes and/or suffixes within a single word, sometimes to such an extent that the distinction between word, phrase and even sentence becomes blurred. That’s where the meme about “Eskimos have 50 words for snow” originates.

I’ve discussed these issues and others in my book Blind Spot:


I’m posting this series of articles as a way of bringing together the various pieces of information about the hoax hypothesis, which are currently scattered across several sites.

Quick reassurance for readers with ethical qualms, about whether this will be a tutorial for fraudsters: I’ll only be talking about ways to tackle authenticity tests that were available before 1912, when the Voynich Manuscript appeared. Modern tests are much more difficult to beat, and I won’t be saying anything about them.

All images above are copyleft Hyde & Rugg, unless otherwise stated. You’re welcome to use the copyleft images for any non-commercial purpose, including lectures, provided that you state that they’re copyleft Hyde & Rugg.

  1. One of the more important points in your explanations is how you dispel a common objection to these hoaxing ideas: That certain features seen in the Voynich text, under certain analysis, could not have been known to a forger in the past. Perhaps they did not know of such features, but of course they did know many, and then the others we see today can easily be accounted for as side effects of the production process.

    There would be a parallel in art forgery, where a copy in the style of an artist from the past, done in the past, could and has passed the tests of modern experts. The right proportion of color, the type of brush, the length and size of stroke, the appropriate use of perspective, may all match. Does it mean they had needed to have a spectrometer? An X-Ray machine to see underneath? Of course not… in the generation of the painting, many features would naturally follow, as side effects, in an attempt to get the overall impression correct. We can’t say, “Because they did not know of the future invention of X-Rays, they would not have made the under-strokes correct”… of course they might, while copying those features of the real artist, they did know should exist.

