Let’s say we’ve downloaded an HTML file, as per usual. Here I’ll discuss how we’d use the hypothetical hardware I’ve established to reformat it into interactive audio!
I’d parse HTML or XML tags & text to populate the datastack, whether pushing, popping, or swapping its head whilst sorting attributes. Running a given callback each time to parse said datastack. This HTML/XML parser would be run multiple times with different callbacks.
Upon HTML closing tags I’d run 2 passes over the datastack in addition to the callback parser. The first checks whether we haven’t already implicitly closed this tag. The second closes all tags up to & including the given one. This strategy isn’t WHATWG-compliant, but its working pretty well in HTML-Conduit.
To allow restyling webpages to work in a vastly different medium than they’re all-too-often designed for it is useful to seperate HTML’s “semantics” from the page’s styles. This is what CSS is for! On our hypothetical hardware I’d implement it by compiling CSS into a parser over the stack of opentags during HTML parsing. So in detail, how would this work?
The first pass would evaluate media queries in the Arithmetic Core & concatenated fetched imports, with recursion & parentheses balancing. Next I’d split style rules with multiple selectors & possibly do some desugaring. That allows a pass which counts up the selector specificity (using the Parsing Unit & Arithmetic Core) to sort stylerules (all subcomponents) by their now-single selector’s specificity! Cascade Layers may use a fastpath involving an ordered-trie to be concatenated for output. A follow-up pass would split off
!important properties for higher precidence.
A 5th pass would desugar some more now that we don’t need to worry about messing up selector-specificity, including handling shorthand properties. And stripping out invalid longhand properties to aid webdevs in targetting multiple browsers. Prior to codegen a 6th pass would sort selector tests into the order we expect to see their matches on the datastack. Then for each rule codegen can emit code that outputs the properties upon seeing a matching element!
Except… The Parsing Unit I described offloads some of it’s instruction-prefetch demands onto the code it runs, so a final 8th pass would need to enforce a rigid structure onto the code we output whilst maintaining the presorted output order. To do so I’d track how much memory is being used. When we overflow a Rule, Grammar, or even Language I’d bucket selectors by their prefix to ensure they fit. Where they don’t overflow we can leave this bucketting to the Parsing Unit instruction decoder.
From the processes I’ve been describing we’d get a mostly standard XML file, though it wouldn’t yet adhere to the relevent schema SSML. So I’d include a pass which deduplicates via the XML-parser’s attribute sorting (supporting
revert would complicate this) & syntactically transforms it into valid SSML. Besides that’d aid a pause-collapsing pass! On the less trivial side
speak: never would strip out elements whilst
content would replace children.
The pause-collapsing pass would manually-parse the XML to merge contiguous
There’s counters to be evaluated over the styletree! Parse counters, increment them by the given value, & serialize according to the referenced
counter-style at-rule. The Parser could lookup counters by name, but implementing a hashmap in the Arithmetic Core may be faster.
The whole processor would perform deduplication as described in my sorting section (later properties taking precedence may be the simpler order to generate). The Parsing & Output Units would do the bulk of this transform. The Arithmetic Core would evaluate counters, possibly using Parsing Unit for counter lookups. The Arithmetic Core & Parsing Unit would do perform the bulk of pause-collapsing.
SSML is a XML-based markup language for denoting the intonations which should be applied to some text. On our hypothetical hardware I’d parse this into a pull pipeline concatenating silences, fetched & decoded audio files, & parameterized speech synthesis.
Then we have
<mark> elements. The spec defers to the embedding app for how to handle those, & suggest that we build a trie mapping these names to it’s slice of this pull pipeline so we can jump to these spots.
In our case we’d hook
<mark> elements up to a callback in the Arithmetic Core’s privileged mode setting the specified counter. So that upon hitting the joystick’s buttons I’d increment or decrement the appropriate counter. Outputting the new label to set the playback-callback to. Horizontal buttons navigate paragraphs. Vertical navigates headers. Or in a <table> they navigate it’s
<td>s 2 dimensionally. Center button switches to silence & back (play/pause).
eSpeak is an iconic robotic-sounding speech synthesizer supporting SSML, implemented predominantly as numerous textual datafiles parsed & interpreted by a C core.
This architecture makes it a strong candidate for porting to our hypothetical device! We just need to port the interpretors to hardware arguably more suited to the task. Amongst these are voices (looked up by name or by perceived gender & age with an index) tweaking these speech-synthesis passes.
The first of these passes is to convert the Unicode characters into more phonetic “phonemes”. eSpeak provides 3 datafiles for this per-language!
Several words, especially our most common ones, do not follow standard pronunciation rules. So eSpeak provides a TSV pronunciation dictionary for these exceptions, which listeners would notice if we didn’t correct them. This includes letter pronunciation, used according to the SSML/CSS/caller-configured mode. Also certain characters like emojis or maths operators need to read out, so they collectively get their own TSV name dictionary.
The final file is a true DSL encoding those normal pronunciation rules! Each line of which matches preceding (has been “eaten”), current (will be eaten), & proceding (won’t be eaten) chars mapping it to a phonetic ASCII alphabet of output.
The ability for the Parsing Unit to be in multiple states would help parse the proceding characters. As for the preceding ones, that could readily be handled if we allow parameterizing the linker… I’d build a trie from the preceding chars for the custom-linker to constrain which rules can match subsequent chars. The parser it generates would speculatively include parsers for other prefixes to fill the memory-page, based on text prediction.
Each line of that file should nicely fit in a “rule”, though there’d often be a massive branching factor we can’t possibly branch-predict. Also we’d record the matched word so we can reconsider pronunciation after stripping the suffix, if we find one…
Also there’d need to be a codepath for serializing numbers to localized words… eSpeak consults it’s pronunciation dictionaries as part of it’s process for this… You can probably imagine how we’d program the Parser Unit to do this for whatever languages you speak!
Emphasis & Ununciation
Emphasis serves to reinforce the meaning of our words & sentences, whilst clarifying any ambiguity. Some languages use emphasis to distinguish words, in others incorrect emphasis sounds very wrong.
So we add a pass(es) which finalizes emphasis from the determined pronunciation according to the language’s various rules. Then determine which frequency/pitches (and their gradients) in the configured range to use according to a chosen mode. That process could be largely done by the Parsing & Output Units, with minor use of the Arithmetic Core for e.g. counting.
This is followed up by converting the phonemes into parameters for shaping a simulated vocal tract, for which eSpeak provides datafiles in a TCL-like scripting language which it precompiles. Whilst pre-generating histograms/”spectograms” to aid transitions. In our hypothetical hardware I’d have the Parsing Unit dispatch these to the Arithmetic Core where the parameters will be used, & we can easily evaluate the conditions.
How would we simulate a vocal tract through which our hypothetical device can monolog webpages at you? The way eSpeak does it is using a technique developed by a rushed Dr. Dennis Klatt in the 1980s to give his voice to the then-recently paralyzed Prof Stephen Hawking. That’s what I’ll explore, but I’m sure we can do better! Even without resorting to machine learning!
At it’s core it sends a choice of a rudimentary waveforms (or occasionally, a WAV recording) through 7-17 “resonators”. These resonators are 3-wide multiply-sums over the previous resonators. Which, combined with some miscallaneous multiplies tweaking results, should just barely fit in the 64 multiply-adds per live audiosample limit I’ve arbitrarily set!
Also there’s echoing, mixing in previous samples. Then we have breathing… This can be represented using random noise, with it’s volumes (amongst other aspects of the speech synthesis) conditional upon 3 summed & scaled sine waves using the builtin wave generators.
So I’d have the Output Unit generate some basic wave forms, the Arithmetic Unit determine which factors to send to the Multiply-Adder conditional upon samples from a wave-generator circuit (representing breathing), & the multiply-adder between us & the speakers would evaluate the “resonators” (plus a couple other things).
Combine that with Parsing & Output Units determining how to pronounce words, and we have a device that can read text aloud! With intonations specified by SSML or CSS!
What intonation knobs do SSML & auditory CSS offer? How’d we implement them in within the eSpeak process?
There’s ofcourse the concept of “voices” which primarily impacts the “resonators” forming the simulated vocal tract. Also there’s volume for a final multiply, combined with stereo positioning implemented by adjusting relative volumes between left & right speaker “drivers”. Rate/duration adjusts the timing of pauses & transitions between phonemes.
Then we have pitch, range, & contour which is resolved (according to pronunciation rules & the specified “stress” level) during an intonation pass over the selected phonemes. A resonator is adjusted (using some trig, probably enough time in transitions & pauses to compute those in the multiply-adder?) to produce the selected frequency.
And finally we have say-as to adjust the rules of when it spells out letters whilst lowering characters to phonemes.
To give the link-nav experience I want where you state a topic for our hypothetical device to infodump on, we need speech recognition. & in turn AI.
Users would start the process by either holding the joystick (interrupts to Arithmetic Core) or saying a “wakeword” (detected by a smaller neural net running in the AI co-processor with a single output, hooked up to cleaned microphone input). In either case this would cause the AI co-processor to switch to a larger neural network translating cleaned microphone input into guessed text.
Maybe we split the speech-recognition into audio -> phonemes & phonemes -> characters/words? I struggle to fully comprehend this! Regardless a model we can use to reason about this is to linguistically model the input as being a state machine with edge probabilities. The computer’s job then is to calculate the possible paths through the “hidden markov model” statemachines explaining the observed output alongside their probabilities. Yielding text + probabilities sent to main processor.
This “confusion graph” could then readily be converted into a parser over the link table we’ve extracted from the HTML into a link. From which we can restart the process I’ve described from the start! Whilst telling the AI chip to tweak its model to rank the accepted result higher for similar input next time. This could increase accuracy in a conversational webform experience.
Computer output, if not its input, is typically modeled numerically. So (let alone that we ask them to do it so often many consider a Full Adder to be the core of computing) computers do a lot of arithmetic! So in turn, since text is a common input, they frequently need to parse numbers!
Parsing an int is easy! For each digit we multiply the result by the “base” (typically 10 or a power-of-2) & add the digit. On our Arithmetic Core
10x = x<<1 + x<<3, power-of-two can be folded into the adds.
Floats parse directly to Scientific Notation, incrementing (or is it decrementing?) the exponent to line up with the decimal place. Then OSs put a lot of work into optimizing conversion from x10^y into x2^y, i.e. dividing by the appropriate power-of-10. But if our floating point arithmetic is mainly converting between metric or percentage units, it’s faster to just not! Though we would still need to convert back to 16bit binary numbers, but given those support a maximum of 5 decimal place values a naive implementation should be fast enough.
As for outputting numbers, we won’t need to do that as often… Serializing a number to text involves repeatedly dividing by the base until we reach 0 & outputting the remainders in reverse order as characters.
Except… Our Arithmetic Core doesn’t directly implement divide-with-remainder! So implement it by repeatedly for each bit:
- Rotate a bit into remainder, whilst rotating carry into dividend.
- Compare remainder against divisor.
- Conditionally subtract!
Use the Lempel-Ziv decompressor to unroll that loop, & have an outer loop output the number.
Except… The notation Europeans adopted from India via Arabia isn’t the only way to write numbers. Especially if you’re numbering lists! But most of those are some variation upon that divide-with-remainder algorithm with or without the outer loop.
Then we have notation like Roman Numerals which requires the versatile Change Making Algorithm! Iterate over some symbols & try subtracting off it’s weight, to determine whether to try again outputting this symbol or procede to the next one. Please note that our Arithmetic Core is more efficient at serializing Roman Numerals (though for e.g. 4 we’re outputting “IIII” not “IV”) than decimal numbering systems, and that Roman Numerals are far from the only “additive” numbering system.
Computers rarely generate audio from scratch (less so when porting eSpeak to our hypothetical hardware), typically anything you hear from them are pre- & post-processed recordings. For which we need standardized fileformats.
A common format for earcons & SFX is WAV. After a header with structural & provenance metadata (encoded in a hierarchy of tagged/sized “chunks”) these store raw audio samples. WAV can store compressed audio, but that hasn’t been coherently designed. Rarely supported.
In our hypothetical hardware, we’d expect CD-quality 16bit 44.1khz stereo output. Though we could easily handle mono too!
Correcting bitdepths could be little more than bitshift. Extra channels could be summed into the left/right channels in the appropriate proportions. Correcting sampling rate would have the Arithmetic Core track where we’re at in the new rate & looking up bell-curve weights for the multiply-adder to interpolate surrounding values in it’s subsampling.
For compressed audio of non-trivial length WAV has extensions, but there’s no coherent design so it’s rarely used or supported. Industry tends to stick to their MP3 standards. Though the public standard FLAC has gained significant adoption due to its excellent performance despite being lossless! FLAC’s the one I’m familiar with, so how does it work? How do we implement it on our hypothetical hardware?
We start by parsing off headers for whoever’s interested (joystick nav? Onscreen display?). Once the Parsing Unit splits the FLAC file into blocks (& subblocks per each channel), the appropriate handler is dispatched to the Output Unit, Arithmetic Core, or multiply-adder to:
- Predict the samples.
- Add in “residuals”.
- Mix the channels.
The Output Unit can handle uncompressible whitenoise or trivially compressible near-silence itself. Aside from those, Linear Prediction multiply-sums the last N samples by factors in the file. This can be handled by the multiply-adder.
Some of those polynomials are common enough that they can be concisely called up by a FLAC file, & those are trivial enough we can let our Arithmetic Core handle them.
In theory 64 multiply-adds per sample might not be enough for degenerate FLAC files (especially taking account later stages), but I sincerely doubt these occur in the wild. And the Arithmetic Core could perform a couple of those multiply-adds itself, it’d just be a little slow at it!
As for inevitably incorrect predictions… They’re almost-always close to correct, so to encode the minor corrections we use “Rice codes”. Take a configurable number of bits as the initial value & count up the number of trailing ones to encode the rest of the number. Our hypothetical Parsing Unit’s bitwise-mode combined with the Arithmetic Core can trivially handle this! These values are added to the output sample for the channel, tweaking future predictions.
Finally in stereo recordings there’s plenty of duplicate data between the 2 channels, so FLAC standardizes a choice of some trivial arithmetic (which our multiply-adder or even our Arithmetic Core can perform) to remove most of that redundancy between the 2 channels. If the file contains more than 2 channels… I might not be able to justify having enough grunt in our hypothetical hardware to decompress them all live!
Resampling, etc might be required, just like for WAV!
The Quite OK Audio format is a simple & fast lossy codec by a software developer Dominic Szablewski exploring the broader space of compression technology. Using Linear Predictive Codes, like FLAC uses but lossy.
We’d implement it by having the Parsing Unit extract the metadata (which I think this aspect of the spec needs to be proofread to avoid complicated edgecases Dominic appears to be unaware of) & lookup the “residuals” to be fed into the multiply-adder alongside 4 “history” & “weights” parsed out of that metadata. The Arithmetic Core would finalize the 4bit “quantized scalefactor”.
Again resampling, etc might be required.