Tonight I wish to start studying how CMU Sphinx works in reference to their hello world app from their getting started tutorial. Specifically I’ll describe PocketSphinx’s
cmd_ln_free_r functions, leaving the 5 other PocketSphinx functions for the rest of the week.
PocketSphinx will power Rhapsode’s voice recognition capabilities, though I’m looking at Mozilla Deep Voice as an alternative I wish to support.
This function (amongst others like
cmd_ln_parse_file_r, which I think I’ll use in Rhapsode) receives configuration arguments using a commandline-like syntax, and mostly reformats it into an array for
parse_options to handle.
From there it reformats the error into a closed hashmap stored in the
cmdln structure, parsed in reference to a second hashmap reformatted from an array provided by
ps_args(), before storing the raw arguments in the allocated
cmd_ln_init(...) is the generic SphinxBase routine
ps_args() is the PocketSphinx component of it, which just returns a program constant. Which is defined via a bunch of macros, both in SphinxBase & PocketSphinx defining options for both.
There’s also a reference-counted destructor
cmd_ln_free_r, to release the memory associated to the configuration, it’s closed hashmap, & it’s array.
The PocketSphinx-specific initializer
ps_init allocates extensive state (some of which is computed or parsed from referenced files), extracting various configuration properties as parsed above.
The corresponding destructor
ps_free simply frees that memory & closes those files.
ps_start_utt starts by validating the recognizers state, resets & starts some performance counters, NULLs/frees previous recognition data including including in the ACoustic MODel & it’s underlying Feature Extractor etc, opens logging files, & calls the
start method on the configured searchers.
ps_end_utt on the other hand validates it’s current state, flushes data from the ACoustic MODel& it’s underlying Feature Extractor (various properties computed live) into a ringbuffer & it’s log files. This is split into read & write stages, the latter of which does most of the processing.
Before flushing data out of the register search (or two) by it’s
step methods, stopping the performance counters, & if configured it’ll retrieve the “hypothesis” (which I’ll describe later) and logs it to a configured logfile.
ps_get_hyp calls a method on the configured “searcher” to read back the (partial?) output, whilst profiling how long it takes.
This refers to an ngram file which may be parsed as a bin, arpa, or dmp file & optionally modified (via a call to the
.apply_weights() method) by the -lw &/or -wip flags.
If it’s a binary format, it first opens the file and checks the header’s correct. Next it reads off the “order” and that many “counts”, before alloc/init’ing the ngram modelwith a method table. Then it reads in dummy data, “quants”, possibly the
ngram_mem property, & the underlying text before closing the file.
The arpa parser also incorporates a “line iter” to help plaintext syntax & uses a seperate routine to build a trie from those parsed/sorted lines with help from a priority queue. dmp meanwhile appears to be a binary format that also requires the same trie builder.
These ngram models includes methods for:
- Freeing it’s memory.
- Store those weights parsed from the input flags.
- Look up a weighted or unweighted score for some key in the trie, which is structured weirdly in memory.
- Add unigram.
- Flush those changes.
Moving onto the allphone-specific parts, it first allocs/inits the memory for the searcher (including a generic initiailzer). From there it initializes a hidden-markov-model, extracts the
-lw flags, builds a graph corresponding to the language model, extracts information from that languge model complaining if one isn’t provided, & initializes the rest of the properties.
The allphone search has methods for:
- Starting a search by resetting the Hidden Markov Model properties, n_hmm/sen_eval, history, acmod->mdef, frame, perf properties, & the HMM’s score, history, & frame properties.
- For each “step” it forwards it clears any HMMs active in the ACoustic MODel, reenabling the ones that have progressed to the current frame. Then it extensively computes a a score for that frame, progresses any active HMMs through their probabilities graph dispatching to different functions depending on the
n_emit_stateproperties. Before finally reading their current state in a more usable format.
- Finishing a search involves traversing the history backwards to find the highest score & outputs profiling data.
.reinit()applies to the generic search properties & the
- Freeing the memory starts with outputting some profiling data before.
- You cannot get a lattice output for a allphone search.
- The hypothesese output is read from the backtrace and all active HMMs.
- It doesn’t know the probabilities.
- You can iterate over the hypothesese.
Hidden Markov Models
In PocketSphinx most searchers (including FSG) use “Hidden Markov Models” (HMM) to recognize phonemes from the computed feature vectors.
An HMM models it’s input as following a probabilistic state machine, and attempts to figure out which state that input is in. The feature vector are various metrics (each of which I don’t comprehend) that describes various aspects of the input sound wave.
It looks best to start by describing it’s datamodel.
An HMM context consists of fields for:
- Number of emitting states
- A 3D (tp[id][from][to]) array of transition probabilities
- A 1D of probabilities for each state
- 2D array for “senome sequence matching”
- Temporary 1D for senome scores, with it’s own freelist.
And the HMM itself contains:
- A context
- Scores and history indices for emitting states
- Non-emitting exit state score & history index
- Senome sequence ID
- Senome or sequence IDs
- Best (emitting) state score in current frame
- The transition matrix ID (first index into the context’s 3D probabilities matrix)
- Frame in which the HMM was last active
- Is it multiplexed?
- Number of emitting states from the context.
There are routines to init, free, & print this data but I’m interested in the eval functions. Of which there are multiple chosen based on those last two properties.
Looking at the most general HMM eval implementation, it starts by computing the basic probabilities for each state based in part on the old one.
Next it computes/selects the most favourable score for the exit state and stores it in the history, followed by the scores for all the other states returning the best score of all of them. The most probable states so far are stored in the history array.
The other variants of this function ammounts to manual inlining.
This is initialized upon encountering & parsing the
-lmctl flags, afterwhich it also captures
-bestpath whilst allocating several lists.
Upon starting a search, it unsets it’s
done flag, flushes it’s model (via it’s
flush method), & starts it’s fwd tree/flat (by unsetting all their properties & reconstructing the latter’s “wordlist” & “channels”).
Upon step it progresses both it’s fwdtree & fwdflat. The fwdtree progresses the ACoustic MODel’s feature vectors & HMMs, normalizing scores (subtracting best score) & pruning worst scores if necessary. 3 layers of HMMs attempts to match the words, the last judged by the ngrams. Following that the fwdtree filters out any ummh “fillers”; scores words according to the HMMs, ACoustic MODel’s MDef, & the ngram’s
score method; deactivate filtered channels; & increment the framecount.
fwdflat alternatively stores the result for the previous frame, renormalizes HMM scores if necessary, evaluates the HMMs for all channels pruning and/or persisting results if necessary, and judges results based on the ngram & fillers list.
Upon finish it’ll reset all the properties for the fwd tree/flat subsearcher & outputs profiling information, and if both are configured the flat subsearcher will be run over the results of the tree subsearcher.
Upon (re)init it (re)allocates it’s arrays for there’s a different number of words available, recalculates beamwidth based on various configuration flags reallocating arrays and the ngram as necessary, & reinits the HMMs for the fwd tree/flat for the ngram words.
As always there’s a free method to release all it’s memory & output profiling information.
You can convert the results recorded by those subsearchers into a lattice, by first constructing the corresponding DAG (skipping fillers, etc) then locating the start and (judging by recorded scores) end nodes. Afterwhich the scores are adjusted according to the ngrams.
You can also get a “hypothesis” for the lattice, or if it’s not complete the underlying recordings.
Scores can only be computed for finished lattices.
You can similarly construct an iterator over the lattice or if it’s not complete it’s underlying recordings assembled into full words with computed scores.
Phone Loop Searcher
Upon start it resets all it’s HMMs and frees any renomalization data.
For each step it progresses the HMMs according via the ACoustic MODel, scores the feature vector via the ACoustic MODel, evaluates all HMMs pruning & recording the results.
Upon finish it doesn’t do anything.
Upon (re)init it initializes the underlying HMMs extracting the
-pl_window flag to determine how many, and extracting the log of the
This time the free method doesn’t also output profiling information.
No lattice, hypothesis, probability, or segment iterator may be outputted from it; it serves only to augment other searchers.
Lattices are a straightforward in-memory (Directed Acyclic) Graph listing with lists of start, end, & all nodes as well as all edges alongside freelists and other data. Which means you can process it pretty much however you like, though it is slightly more expensive for PocketSphinx to build.
State Align Searcher
The State Align Searcher must be initialized (by initializing an HMM & extracting information from the provided “alignment” which it loads into said HMM) by the calling application rather than flags, and specifies the text the user is expected to read aloud.
Upon start it “enters” the HMM.
Upon step it progresses the HMMs according to the ACoustic MODel’s feature vector, renormalizing & pruning the HMMs whilst recording the remaining transitions.
Upon finish it finds the best exit state of the HMM and iterates over the frames to see & record how they align before extrapolating those transitions from applying to states to apply to phones and then words.
It does nothing upon reinit, and doesn’t do output profiling upon free.
To return a hypothesis it iterates over the alignments returning how far it got through the alignment text. And you can iterate over the possible alignments.