CMU Sphinx

Tonight I wish to start studying how CMU Sphinx works in reference to their hello world app from their getting started tutorial. Specifically I’ll describe PocketSphinx’s cmd_ln_init & cmd_ln_free_r functions, leaving the 5 other PocketSphinx functions for the rest of the week.

PocketSphinx will power Rhapsode’s voice recognition capabilities, though I’m looking at Mozilla Deep Voice as an alternative I wish to support.

This function (amongst others like cmd_ln_parse_file_r, which I think I’ll use in Rhapsode) receives configuration arguments using a commandline-like syntax, and mostly reformats it into an array for parse_options to handle.

From there it reformats the error into a closed hashmap stored in the cmdln structure, parsed in reference to a second hashmap reformatted from an array provided by ps_args(), before storing the raw arguments in the allocated cmd_ln_t structure.

Whereas cmd_ln_init(...) is the generic SphinxBase routine ps_args() is the PocketSphinx component of it, which just returns a program constant. Which is defined via a bunch of macros, both in SphinxBase & PocketSphinx defining options for both.

There’s also a reference-counted destructor cmd_ln_free_r, to release the memory associated to the configuration, it’s closed hashmap, & it’s array.

The PocketSphinx-specific initializer ps_init allocates extensive state (some of which is computed or parsed from referenced files), extracting various configuration properties as parsed above.

The corresponding destructor ps_free simply frees that memory & closes those files.

ps_start_utt starts by validating the recognizers state, resets & starts some performance counters, NULLs/frees previous recognition data including including in the ACoustic MODel & it’s underlying Feature Extractor etc, opens logging files, & calls the start method on the configured searchers.

ps_end_utt on the other hand validates it’s current state, flushes data from the ACoustic MODel& it’s underlying Feature Extractor (various properties computed live) into a ringbuffer & it’s log files. This is split into read & write stages, the latter of which does most of the processing.

Before flushing data out of the register search (or two) by it’s finish & step methods, stopping the performance counters, & if configured it’ll retrieve the “hypothesis” (which I’ll describe later) and logs it to a configured logfile.

ps_get_hyp calls a method on the configured “searcher” to read back the (partial?) output, whilst profiling how long it takes.

This refers to an ngram file which may be parsed as a bin, arpa, or dmp file & optionally modified (via a call to the .apply_weights() method) by the -lw &/or -wip flags.

If it’s a binary format, it first opens the file and checks the header’s correct. Next it reads off the “order” and that many “counts”, before alloc/init’ing the ngram modelwith a method table. Then it reads in dummy data, “quants”, possibly the ngram_mem property, & the underlying text before closing the file.

The arpa parser also incorporates a “line iter” to help plaintext syntax & uses a seperate routine to build a trie from those parsed/sorted lines with help from a priority queue. dmp meanwhile appears to be a binary format that also requires the same trie builder.

These ngram models includes methods for:

Moving onto the allphone-specific parts, it first allocs/inits the memory for the searcher (including a generic initiailzer). From there it initializes a hidden-markov-model, extracts the -allphone_ci & -lw flags, builds a graph corresponding to the language model, extracts information from that languge model complaining if one isn’t provided, & initializes the rest of the properties.

The allphone search has methods for: