Tonight I wish to start studying how CMU Sphinx works in reference to their hello world app from their getting started tutorial. Specifically I’ll describe PocketSphinx’s
cmd_ln_free_r functions, leaving the 5 other PocketSphinx functions for the rest of the week.
PocketSphinx will power Rhapsode’s voice recognition capabilities, though I’m looking at Mozilla Deep Voice as an alternative I wish to support.
This function (amongst others like
cmd_ln_parse_file_r, which I think I’ll use in Rhapsode) receives configuration arguments using a commandline-like syntax, and mostly reformats it into an array for
parse_options to handle.
From there it reformats the error into a closed hashmap stored in the
cmdln structure, parsed in reference to a second hashmap reformatted from an array provided by
ps_args(), before storing the raw arguments in the allocated
cmd_ln_init(...) is the generic SphinxBase routine
ps_args() is the PocketSphinx component of it, which just returns a program constant. Which is defined via a bunch of macros, both in SphinxBase & PocketSphinx defining options for both.
There’s also a reference-counted destructor
cmd_ln_free_r, to release the memory associated to the configuration, it’s closed hashmap, & it’s array.
The PocketSphinx-specific initializer
ps_init allocates extensive state (some of which is computed or parsed from referenced files), extracting various configuration properties as parsed above.
The corresponding destructor
ps_free simply frees that memory & closes those files.
ps_start_utt starts by validating the recognizers state, resets & starts some performance counters, NULLs/frees previous recognition data including including in the ACoustic MODel & it’s underlying Feature Extractor etc, opens logging files, & calls the
start method on the configured searchers.
ps_end_utt on the other hand validates it’s current state, flushes data from the ACoustic MODel& it’s underlying Feature Extractor (various properties computed live) into a ringbuffer & it’s log files. This is split into read & write stages, the latter of which does most of the processing.
Before flushing data out of the register search (or two) by it’s
step methods, stopping the performance counters, & if configured it’ll retrieve the “hypothesis” (which I’ll describe later) and logs it to a configured logfile.
ps_get_hyp calls a method on the configured “searcher” to read back the (partial?) output, whilst profiling how long it takes.
This refers to an ngram file which may be parsed as a bin, arpa, or dmp file & optionally modified (via a call to the
.apply_weights() method) by the -lw &/or -wip flags.
If it’s a binary format, it first opens the file and checks the header’s correct. Next it reads off the “order” and that many “counts”, before alloc/init’ing the ngram modelwith a method table. Then it reads in dummy data, “quants”, possibly the
ngram_mem property, & the underlying text before closing the file.
The arpa parser also incorporates a “line iter” to help plaintext syntax & uses a seperate routine to build a trie from those parsed/sorted lines with help from a priority queue. dmp meanwhile appears to be a binary format that also requires the same trie builder.
These ngram models includes methods for:
- Freeing it’s memory.
- Store those weights parsed from the input flags.
- Look up a weighted or unweighted score for some key in the trie, which is structured weirdly in memory.
- Add unigram.
- Flush those changes.
Moving onto the allphone-specific parts, it first allocs/inits the memory for the searcher (including a generic initiailzer). From there it initializes a hidden-markov-model, extracts the
-lw flags, builds a graph corresponding to the language model, extracts information from that languge model complaining if one isn’t provided, & initializes the rest of the properties.
The allphone search has methods for:
- Starting a search by resetting the Hidden Markov Model properties, n_hmm/sen_eval, history, acmod->mdef, frame, perf properties, & the HMM’s score, history, & frame properties.
- For each “step” it forwards it clears any HMMs active in the ACoustic MODel, reenabling the ones that have progressed to the current frame. Then it extensively computes a a score for that frame, progresses any active HMMs through their probabilities graph dispatching to different functions depending on the
n_emit_stateproperties. Before finally reading their current state in a more usable format.
- Finishing a search involves traversing the history backwards to find the highest score & outputs profiling data.
.reinit()applies to the generic search properties & the
- Freeing the memory starts with outputting some profiling data before.
- You cannot get a lattice output for a allphone search.
- The hypothesese output is read from the backtrace and all active HMMs.
- It doesn’t know the probabilities.
- You can iterate over the hypothesese.