I’ve been studying eSpeak NG (in preparation for building a browser around it (and later Julius). eSpeak is a commandline tool and C library for converting text to speech, either outputting wavefiles or playing them through your speakers.
I certainly don’t understand all of eSpeak as it implements processes I’m not conscious of, but this morning I will explain what I do understand of how it works.
The interesting thing about eSpeak is that the codebase is mostly made up of data files, written in their own Domain-Specific Languages, describing how to pronounce words in various languages. eSpeak’s code largely serves to interpret these DSLs and make corrections for the output to sound more natural.
The first DSL eSpeak interprets comes from the caller as it may be instructed to interpret XML tags or square bracketed pronunciations. These operations may optionally be handled asynchronously via a ringbuffer.
The next step is the translation of text into “phonemes”, for which each language typically has three files to interpret. Some characters like emojis translate directly to words, and those are handled via a simple Tab-Seperated Values file. Mostly it relies on a file matching certain character sequences within individual words, though it’ll fallback to another TSV file in order to spell out words it can’t pronounce.
To interpret the pronunciation rules, they are indexed by their first character or two and are ranked by their specificity. From there these rules can match the current characters (which it “eats”), the following characters (which it doesn’t eat) or the preceding characters (which have already been eaten).
Upon the first successful match it’ll replace those “current” characters with ones more closely representing the pronunciation.
Once we have a sequence of “phonemes” eSpeak NG needs to translate them into sounds. For many phonemes eSpeak packages brief WAV files (though I recall them starting out with just the “a” sound), whilst others are defined by another DSL distorting these sounds.
This one includes IF statements like you’d see in any precedural language, and a compiler lowering those IFs to a more efficient bytecode. The sound effects it outputs are applied in a seperate pass.
Finally that gives us sound waves for it to output via a shallow wrapper around PulseAudio/ALSA/Core Audio/etc or save it out as a WAV file.
The other notable thing about eSpeak is that it has a number of key=value “voice” files augmenting the phoneme -> sound or, to a significantly lesser extent, text -> phoneme DSLs with minor variations.