My Rhapsode auditory web browser relies upon Voice2JSON for a high-level voice-to-speech recognition engine. This page describes how Voice2JSON & several of it’s underlying tools works. While Voice2JSON shares much code with Rhasspy I will not distinguish where that split is.
Before you can use Voice2JSON you need to seperately download a “profile” for your language, configure sentences.ini within it, & run
voice2json train. Rhapsode 5 does the last two steps itself, & instructs you on performing the first.
voice2json train converts that sentences.ini file Rhapsode can generate into a lower-level “language model” understood by the “backend” (which may be PocketSphinx, Kaldi, Mozilla DeepSpeech, or Julius), merging it with those from the downloaded profile.
Once logging is setup, the commandline arguments are parsed, & configuration is loaded (apparantly the profile can be autodownloaded, but it’s not working for me) into a Voice2JsonCore object
voice2json train-profile with profiling starts by parsing the sentences.ini file.
The first pass of parsing is done using Python’s builtin INI parser, before handing off to JSGF (Java Speech Grammar Format) parser to handle the grammar rules on each line. Which uses a handwritten pushdown automaton parser with no lexer.
With the phrases & “rules” they might reference collected into seperate mappings, it recurses over the parsed Abstract Syntax Trees (ASTs) to normalize upper/lower case and/or convert numeric ranges into a “slot command” running
It then gathers the set of all “slots” referenced by the grammar rules, so it can load the corresponding plain text (JSGF) or executable files to load. Yielding a dictionary of all slot values to be merged with the rules dictionary.
The next piece of Java Speech Grammar Format (JSGF)
voice2json train-profile lowers is any arabic numerals which it passes through the num2words Python module to match how you’d say them in your native language, annotating the result so
recognize-intent will yield those digits instead.
num2words has a different class for each human language it supports which it’ll dispatch to, which’ll usually refer to a “cards” mapping from denominations to names.
To get the weights right on the corresponding graph for the grammar it first counts how many expanded sentences there are for each “intent” (command triggered by saying one of several “sentences”), normalizing them by dividing by the Least Common Multiple then sum.
It creates a start node in the graph globally & for each intent. The final nodes are gathered afterwords to link them all to a single global one.
Active substitutions are counted. Tagged AST nodes emits a
If there’s any “converters” (JSON type conversion program or Python function) on a tagged or substituted AST node they are added onto a stack for which it’ll emit
__convert__… edges for all of them in reverse order.
It recurses into each subexpression for a collection.
ALTERNATIVE sequences share the same start & end nodes accross all subexpressions.
GROUP chains them together.
Word nodes adds an edge for themselves & any substitutions if nothing else is silencing them.
Rules & slot references are looked up in the previously-computed mapping before converting those expressions into a graph. Slots also add
__source__… edges to indicate their names.
Trailing any of that recursion it emits edges to close any specified substitutions, JSON type conversion, & tagging.
This graph will then be Python “pickled” (standard Python (de)serializer) & GZip’d.
It reads in the pronunciations table & the rest is back-end specific.
The Python module used to represent the graph corresponding to a voice command grammar is “networkx” & it’s DiGraph class, which stores sucessor & predecessor 2D edge mappings containing all caller-specified attributes. Nodes also have their own attributes mapping, from which new node IDs for the edge mappings are allocated.
A tool called Phonetisaurus is used to guess any missing pronunciations, which I should probably have installed! I’ll discuss this component later.
For PocketSphinx it converts that graph to an ARPA language model, & writes out the subset of pronunciations it requires.
It converts the graph to an ARPA language model by first outputting it to the binary FST format & running various OpenNGram (which I’ll also describe later) commands over it, whilst extracting all input words from the graph edges to output to a seperate vocabulary file.
For Kaldi it specially outputs a G.fst file or generates an ARPA language model as per PocketSphinx. It writes the pronunciation dictionary & language model, extends the shell PATH for Kaldi commands, clears & repoluates some files & dirs, & runs some OpenNGram & Kaldi commands.
For Mozilla DeepSpeech it also converts the graph to an ARPA language model. which it’ll run further commands to convert to a binary format & to “generate a scorer package”.
For Julius it adds stop/silence words to the vocabularynromalize arguments, converts the graph to an ARPA language model, & writes the pronunciations files.
Turns out this backend-specific code is fairly similar…
As any English (amongst other languages) speaker will tell you, the way we write or type our words do not necessarily line up to how we say them. On Rhapsode’s output side I use eSpeak NG to resolve all the pronunciation rules (and exceptions) amongst other things. On the input side Voice2JSON uses Phonetisaurus which can, with or without a neuralnet, learn your picularities. The results are handed off to your selected backend.
For every word in the grammar Voice2JSON doesn’t yet know how to pronounce it consults
phonetisaurus-apply with the pre-prepared FST model.
Once it’s parsed all the commandline arguments, configured logging, & validated
phonetisaurus-apply it wraps
phonetisaurus-g2pfst with or without first consulting a “lexicon” file.
Once it’s parsed all the commandline arguments & referenced files
phonetisaurus-g2pfstevaluates that model in C++.
This core “Phoneticize” logic of
phonetisaurus-g2pfst starts by splitting the word optionally by some delimiter & looking up a number to represent each character(s). Which it’ll then iterate over to produce a Finite State Automaton (graph/network) represented via OpenFST (another project in the stack to study!), which it’ll combine with the preprepared model.
After adjusting the FSA it searches from the final state to the first to guess which edges were taken to produce the output.
With the results of that “Hidden Markov Model” it applies an array of “filters” twice to postprocess the cost.
So what’s OpenFST?
It’s a C++ self-contained mostly-headerfile library for representing NFA & DFA ([Non-]Deterministic Finite Automatons) directed labelled networks for simple parsing. With (de)serialization to/from LempelZiv-compressed files, and common/generic transformations. Many of which are wrapped in shell commands.
And collections for implementing them.
NFA to NGrams ARPA via Opengrm
If when I discussed OpenFST you thought of the networkx graph Voice2JSON parses it’s sentences.ini files to, you’d be correct! That graph is also serialized to a OpenFST graph to be processed into NGrams via opengrm. Some variant or other of those NGrams is what’s handed to the voice recognition backend as describing what it should expect to hear.
Voice2JSON uses it to gather count estimates from the graph for each NGram in a parallel array, and convert back into a graph, …
In recreating the OpenFST graph for some n-gram counts it can fit those counts to a specified distribution, in Voice2JSON’s case Witten Bell. These distributions are C++ subclasses of the class which rebuilds the FST.
Voice2JSON further uses opengrm to optionally merge & normalize that ngram-normalized model with a prepared one from the profile (with different formulas to re-weight shared edges), and to output it to ARPA format which is directly or indirectly supported by the backends.
Once Voice2JSON has fine tuned the backend voice recognition engine to better recognize the vocabulary an app like Rhapsode expects, we need to actually feed said backend audio input. From this we get a “transcribed” text which isn’t yet constrained to match the listed voice commands, though it’ll still be useful to call the raw command for, say, input[type=text] controls.
Once the normal commandline parsing & initialization is done, it creates an audio source & a segmenter.
The segmenter is sourced Google’s WebRTC implementation, whilst the audio input is either a commandline-specified file or a YAML-configured command defaulting to
arecord -q -r 16000 -c 1 -f S16_LE -t raw. In either case it expects to see a raw 16bit 16khz mono WAV file.
Another subcommand doesn’t do the segmenting, passing the prerecorded wav file directly over to the backend.
It also constructs a Transcriber to wrap the backend speech-to-text engine.
It reads the audio from the live file stream in fixed-sized chunks. For each it atomically resets a flag, sends the chunk through the segmenter & transcriber, & outputs any outstanding transcriptions.
If the segmenter detected a pause in speech it enqueues None onto the transcriber, stops & restarts the segmenter, & extracts it’s audio to optionally output to stdout and/or a specified directory. It counts transcriptions, waits for the next, & validates before restarting the segmenter.
The transcriber runs in it’s own thread (because apparantly threading no longer sucks in Python!) utilizing an atomic queue, printing the results itself before resetting the atomic condition.
For PocketSphinx it uses language bindings to run it in-process:
- initialize commandline options with what
- Convert all raw audio bytes to a C file descriptor to hand to
- Wrap the result in the standard Voice2JSON type
For Kaldi it might use one of two approaches.
It can start it’s neural net command & write the wav files to it’s stdin & FIFO inputs, before parsing it’s stdout into standard Voice2JSON types. Or it can concatenate the wavfile to send it to the pipeline for GMM language models.
For Julius it runs the command with appropriate flags, points Julius to each concatenated WAV file, & parses it’s results.
Finally for Mozilla DeepSpeech it loads the neural net if missing with the configured scorer if missing, has it open an audio stream, to feed the WAV to, & extracts the results to be extensively converted into the standard Voice2JSON type.
The segmenter or, as they call it, “recorder” adds a couple layers around Google’s WebRTC code to analyze whether you’re talking. Not as straightforward as you might guess.
The default audio input is from ALSA, possibly intercepted by Pulse Audio.
The process of converting a graph-based language model to a ngram-based language model is lossy, so there’s a seperate command
voice2json recognize-intent for constraining the backend’s(via the
transcribe-stream subcommands) output to be constrained by the graph. This can be done fuzzily or strictly according to the profile’s configuration.
Rhapsode further applies it’s own fuzzy matching (levenshtein distance) to determine which URL to load.
The first thing the
voice2json recognize-intent subcommand does is to read in all “stopwords” to remove from the transcribed speech, and the available converter commands to possibly run over any captured “tags”, & reads in the previously “pickled” (Python-serialized) & GZip’d graph.
It iterates over & optionally JSON-parses every sentence piped on stdin from the backend. It splits each sentence by space, optionally replaces any digits, applies the constraints, & serializes that to JSON.
For strict matching
voice2json recognize-intent finds the start node & breadth-first traverses the transcription as a path through the Non-Finite Automaton. Failing this it tries again after removing stopwords.
For lazy matching performs a breadth-first search discarding any partially-computed results worse than the best so far. Yielding that top result.
In either case it postprocesses the selected path through the graph.
The first pass of this postprocessing gathers all “output labels”, thereby applying any substitutions specified by the grammar. Including the substitutions inserted when expanding numbers into text.
Then it iterates over them again to track a stack of active converters along which path segment & run the looked up converters upon closing a segment.
Next it repeats a similar process to gather each tagged segment to produce a list (and mapping) of “entities”. This’ll be crucial for webforms!
Finally it concatenates the words into a string & computes how confident it is in the new transcription.
recognize-intent can be used together to determine which voice commands the user has said. But how do we know when to listen for a voice command?
We could use some other channel like, say, the spacebar, and/or we could use
wait-wake to determine when the program’s name is called.
voice2json wait-wake streams the same 16bit 16khz audio input (that 16khz sampling rate is interesting: it’s far below the standard 44.1khz, presumably to reduce the computational load without cutting off typical frequencies of human speech) as
transcribe-stream to Mycroft Precise command & edge detects the output probabilities.
Mycroft Precise is a Tensorflow or Keras spectograms to (un-sigmoid’d) probability neuralnet with tools to train it on a directory of WAV files & compare the resulting performance to PocketSphinx. These try minimize both false positives & consecutive false negatives, the “annoyance” factor.
Keras is easier to train, but Tensorflow is faster so there’s a command to convert from Keras to Tensorflow neuralnets.
sonopy or (legacy
speechpy) computes the spectograms.
Tensorflow is Google’s heavily optimized (for a wide variety of hardware, including custom designs) neuralnet C++ implementation, with lots of language bindings. Also there’s JIT’ing, is this for evolutionary programming? And schedules tasks between all the different hardware e.g. Google owns. While it’s not my taste I can see the appeal of having a singular toolbox for machine learning, and I’m sure it has features other equivalent tools lack.
It’s quite a massive, very thorough codebase for a relatively simple concept which I’ll let 3Blue1Brown explain: https://invidious.silkky.cloud/watch?v=aircAruvnKk
Underlying Voice2JSON is a choice of four lower-level speech-to-text backends.
CMU PocketSphinx is probably the fastest & arguably simplest. I don’t understand any details, but it measures various aspects of the input audio & matches them to a “language model” of one of a handful of types. Including wakewords.
Kaldi (named for the Ethiopean goatherder who discovered the coffee plant) appears at a glance to works similarly to PocketSphinx, except it supports many, many more types of language models. Including severaltypes of neuralnets (custom implementation). Maybe even combining them?
I have heard that Kaldi’s output’s better, which makes sense. Interestingly both Kaldi & PocketSphinx provides GStreamer plugins, so how about exposing this automated subtitling @elementary ?
Mozilla DeepSpeech appears to be little more than specially trained Tensorflow neuralnets (“deep learning”, since apparantly neuralnets need a new name when they’re too big). These tend to require much more training, but once you do they can give excellent results.
One drawback of neuralnets is that it’s harder for Voice2JSON to guide it to follow a more constrained grammar. So it converts that model to a “scorer” which inform the neuralnet how well it did so it can improve next time.
Finally Julius seems to be more opinionated about the type of language models it uses, which should help to make it fast & reliable. Specifically it uses “word N-gram and context-dependent HMM” via “2-pass tree-trellis search”.