CMU Sphinx

Tonight I wish to start studying how CMU Sphinx works in reference to their hello world app from their getting started tutorial. Specifically I’ll describe PocketSphinx’s cmd_ln_init & cmd_ln_free_r functions, leaving the 5 other PocketSphinx functions for the rest of the week.

PocketSphinx will power Rhapsode’s voice recognition capabilities, though I’m looking at Mozilla Deep Voice as an alternative I wish to support.

This function (amongst others like cmd_ln_parse_file_r, which I think I’ll use in Rhapsode) receives configuration arguments using a commandline-like syntax, and mostly reformats it into an array for parse_options to handle.

From there it reformats the error into a closed hashmap stored in the cmdln structure, parsed in reference to a second hashmap reformatted from an array provided by ps_args(), before storing the raw arguments in the allocated cmd_ln_t structure.

Whereas cmd_ln_init(...) is the generic SphinxBase routine ps_args() is the PocketSphinx component of it, which just returns a program constant. Which is defined via a bunch of macros, both in SphinxBase & PocketSphinx defining options for both.

There’s also a reference-counted destructor cmd_ln_free_r, to release the memory associated to the configuration, it’s closed hashmap, & it’s array.


The PocketSphinx-specific initializer ps_init allocates extensive state (some of which is computed or parsed from referenced files), extracting various configuration properties as parsed above.

The corresponding destructor ps_free simply frees that memory & closes those files.


ps_start_utt starts by validating the recognizers state, resets & starts some performance counters, NULLs/frees previous recognition data including including in the ACoustic MODel & it’s underlying Feature Extractor etc, opens logging files, & calls the start method on the configured searchers.

ps_end_utt on the other hand validates it’s current state, flushes data from the ACoustic MODel& it’s underlying Feature Extractor (various properties computed live) into a ringbuffer & it’s log files. This is split into read & write stages, the latter of which does most of the processing.

Before flushing data out of the register search (or two) by it’s finish & step methods, stopping the performance counters, & if configured it’ll retrieve the “hypothesis” (which I’ll describe later) and logs it to a configured logfile.


ps_get_hyp calls a method on the configured “searcher” to read back the (partial?) output, whilst profiling how long it takes.

This refers to an ngram file which may be parsed as a bin, arpa, or dmp file & optionally modified (via a call to the .apply_weights() method) by the -lw &/or -wip flags.

If it’s a binary format, it first opens the file and checks the header’s correct. Next it reads off the “order” and that many “counts”, before alloc/init’ing the ngram modelwith a method table. Then it reads in dummy data, “quants”, possibly the ngram_mem property, & the underlying text before closing the file.

The arpa parser also incorporates a “line iter” to help plaintext syntax & uses a seperate routine to build a trie from those parsed/sorted lines with help from a priority queue. dmp meanwhile appears to be a binary format that also requires the same trie builder.

These ngram models includes methods for:


Moving onto the allphone-specific parts, it first allocs/inits the memory for the searcher (including a generic initiailzer). From there it initializes a hidden-markov-model, extracts the -allphone_ci & -lw flags, builds a graph corresponding to the language model, extracts information from that languge model complaining if one isn’t provided, & initializes the rest of the properties.

The allphone search has methods for:

Hidden Markov Models

In PocketSphinx most searchers (including FSG) use “Hidden Markov Models” (HMM) to recognize phonemes from the computed feature vectors.

An HMM models it’s input as following a probabilistic state machine, and attempts to figure out which state that input is in. The feature vector are various metrics (each of which I don’t comprehend) that describes various aspects of the input sound wave.


It looks best to start by describing it’s datamodel.

An HMM context consists of fields for:

And the HMM itself contains:

There are routines to init, free, & print this data but I’m interested in the eval functions. Of which there are multiple chosen based on those last two properties.

Looking at the most general HMM eval implementation, it starts by computing the basic probabilities for each state based in part on the old one.

Next it computes/selects the most favourable score for the exit state and stores it in the history, followed by the scores for all the other states returning the best score of all of them. The most probable states so far are stored in the history array.

The other variants of this function ammounts to manual inlining.

NGrams Searcher

This is initialized upon encountering & parsing the -lm (without -allphone) or -lmctl flags, afterwhich it also captures -fwdflat, -fwdtree, -latsize, & -bestpath whilst allocating several lists.

Upon starting a search, it unsets it’s done flag, flushes it’s model (via it’s flush method), & starts it’s fwd tree/flat (by unsetting all their properties & reconstructing the latter’s “wordlist” & “channels”).

Upon step it progresses both it’s fwdtree & fwdflat. The fwdtree progresses the ACoustic MODel’s feature vectors & HMMs, normalizing scores (subtracting best score) & pruning worst scores if necessary. 3 layers of HMMs attempts to match the words, the last judged by the ngrams. Following that the fwdtree filters out any ummh “fillers”; scores words according to the HMMs, ACoustic MODel’s MDef, & the ngram’s score method; deactivate filtered channels; & increment the framecount.

fwdflat alternatively stores the result for the previous frame, renormalizes HMM scores if necessary, evaluates the HMMs for all channels pruning and/or persisting results if necessary, and judges results based on the ngram & fillers list.

Upon finish it’ll reset all the properties for the fwd tree/flat subsearcher & outputs profiling information, and if both are configured the flat subsearcher will be run over the results of the tree subsearcher.

Upon (re)init it (re)allocates it’s arrays for there’s a different number of words available, recalculates beamwidth based on various configuration flags reallocating arrays and the ngram as necessary, & reinits the HMMs for the fwd tree/flat for the ngram words.

As always there’s a free method to release all it’s memory & output profiling information.

You can convert the results recorded by those subsearchers into a lattice, by first constructing the corresponding DAG (skipping fillers, etc) then locating the start and (judging by recorded scores) end nodes. Afterwhich the scores are adjusted according to the ngrams.

You can also get a “hypothesis” for the lattice, or if it’s not complete the underlying recordings.

Scores can only be computed for finished lattices.

You can similarly construct an iterator over the lattice or if it’s not complete it’s underlying recordings assembled into full words with computed scores.

Phone Loop Searcher

Upon start it resets all it’s HMMs and frees any renomalization data.

For each step it progresses the HMMs according via the ACoustic MODel, scores the feature vector via the ACoustic MODel, evaluates all HMMs pruning & recording the results.

Upon finish it doesn’t do anything.

Upon (re)init it initializes the underlying HMMs extracting the -pl_window flag to determine how many, and extracting the log of the -pl_weight, -pl_beam, -pl_pbeam, & -pl_pip flags.

This time the free method doesn’t also output profiling information.

No lattice, hypothesis, probability, or segment iterator may be outputted from it; it serves only to augment other searchers.

Lattices

Lattices are a straightforward in-memory (Directed Acyclic) Graph listing with lists of start, end, & all nodes as well as all edges alongside freelists and other data. Which means you can process it pretty much however you like, though it is slightly more expensive for PocketSphinx to build.

State Align Searcher

The State Align Searcher must be initialized (by initializing an HMM & extracting information from the provided “alignment” which it loads into said HMM) by the calling application rather than flags, and specifies the text the user is expected to read aloud.

Upon start it “enters” the HMM.

Upon step it progresses the HMMs according to the ACoustic MODel’s feature vector, renormalizing & pruning the HMMs whilst recording the remaining transitions.

Upon finish it finds the best exit state of the HMM and iterates over the frames to see & record how they align before extrapolating those transitions from applying to states to apply to phones and then words.

It does nothing upon reinit, and doesn’t do output profiling upon free.

To return a hypothesis it iterates over the alignments returning how far it got through the alignment text. And you can iterate over the possible alignments.