How It Works
The linguistics behind the nonsense
The Problem With Pure Randomness
A truly random string of characters — say, xqzbt or fkrmpl — is unpronounceable and immediately recognisable as machine-generated. For a word to feel like it could be a real English word, it needs to follow the same phonological rules that real English words follow.
Those rules are not arbitrary. Every language constrains which sounds can appear next to each other, in what positions, and in what combinations. This is called phonotactics. Wurbz applies English phonotactics at every stage of word construction.
Syllable Structure: Onset · Nucleus · Coda
The fundamental unit of English phonology is the syllable. Every English syllable is built from three components:
- Onset — the opening consonant or consonant cluster. Can be empty (as in up), a single consonant (t in top), or a cluster of up to three consonants (str in strength).
- Nucleus — the vowel core of the syllable. Can be a single vowel (a, e, i, o, u) or a vowel digraph (ea, ou, ai, ee).
- Coda — the closing consonant or consonant cluster. Can be empty (open syllable), a single consonant (t, n, d), or a cluster (nd, st, ng). English also allows derivational suffixes here: -ing, -tion, -ness, -er.
Each syllable in a generated word is assembled from these three slots independently, then joined together.
Weighted Phoneme Frequency
Not all sounds are equally common in English. The generator uses frequency tables where each phoneme or cluster is assigned a weight proportional to how often it appears in real English words.
Common onsets (high weight): s, c, p, t, m, b, r, st, br, tr
Rare onsets (low weight): x, z, qu, wh, str, spr
Common nuclei: a, e, i, o, er, ea, ou
Common codas: -e (silent final-e pattern), -t, -n, -d, -s, -ng, -tion, -er, -ing, -ness
A weighted random selection is made at each slot, so common patterns appear more often while rare ones still occur occasionally — just as in real English vocabulary.
Syllable Count Patterns
Words are built from one of five patterns, each with a weighted probability that reflects the natural distribution of English word lengths:
| Pattern | Example shape | Chance |
|---|---|---|
| Monosyllabic | CVC → Grolt | 25% |
| Disyllabic (CVC + CVC) | Bran · ston | 30% |
| Disyllabic (CV + CVC) | Ve · xmore | 25% |
| Disyllabic + suffix | Crest · er | 15% |
| Trisyllabic | Me · rri · den | 5% |
The disyllabic patterns are the most common because two-syllable words dominate everyday English. The trisyllabic pattern is rare by design — longer words are harder to evaluate quickly and less useful as brand names or character names.
A Word Being Built: Step by Step
Here is an example of the disyllabic (CVC + CVC) pattern producing the word Merriden:
Validation Rules
After a candidate word is assembled, it is tested against a set of hard rules. Words that fail are discarded and a new attempt is made (up to 50 times).
- Must contain at least one vowel. Pure consonant strings are rejected.
- No three or more consecutive consonants. Clusters like nstr or rltk are unpronounceable in English and rejected.
- No three or more consecutive vowels. Sequences like aou or eea produce awkward results.
- Q must be followed by U. The English spelling rule qu is enforced; a lone q is rejected.
- Length between 3 and 12 characters. Very short words are too common; very long words are unwieldy.
Refinement
Words that pass validation are then refined to remove a few patterns that slip through the statistical model:
- Triple character runs — sequences of three identical characters in a row (e.g., sss) are collapsed to two.
- Malformed Q — any
qnot already followed byuhas auinserted after it. - Double-J —
jjis reduced to a singlej, since this cluster never appears in English.
After refinement, the final word is capitalised and returned to the user.
Fallback Words
In extremely rare cases — when 50 consecutive generation attempts all fail validation — the generator falls back to a curated list of hand-picked nonsense words. This list was assembled manually to guarantee quality and serves as a safety net, not a primary source. In practice, it is almost never reached.