How would our hypothetical hardware-browser decode & display the images commonly found online?
I’d have the main processor output texture data alongside some opcodes telling the compositor how to dereference a coordinate stored in certain registers. Thus giving those registers special meaning.
The Layout Co-Processor may wrap the images in additional opcodes (operating on those special registers) to implement image resizing & the like.
Atlasing
The compositing hardware will only have so many “texture units” in which to store images to be composited onscreen. So what’d we do if we have more images to render onscreen than straightforwardly fit in the Compositing Co-processor? We’d merge them into a single image!
This can be split between the Layout & Compositing Co-processors.
I’d extend CSS’s DSL so that there’s an alternate entrypoint the main processor can call when it runs out of texture units. This entrypoint would gather up all resized/transformed images from the styletree & max-heap by height. With the Layout Co-Processor’s concurrency, this should be done by the time the Compositor Co-Processor is ready for more work. During a static frame the Layout Unit would repeatedly lookup the next image & where to place it to be concurrently composited into the atlas.
As I do so the layout unit would track the “skyline” (using an urban metaphor) of how tall the atlas is at each range of x coordinates. Both look ups can benefit from the Layout Concurrency & complete well before the Compositing Unit dequeues the resulting sprite.
This doesn’t actually stop the Compositing Unit from needing to load new images mid-frame, which could be optimized using a ringbuffer of upcoming images. It however does cut down significantly on memory use & compute overhead in the subsequent frames.
QOI
QOI is an image format designed by a gamedev trying his hand at compression. The results are fairly competitive (and faster to compress!) on a wide range of sample images, whilst being trivial to implement. QOI (after a brief header) consists of a handful of opcodes emitting pixels left-to-right top-to-bottom.
One of those opcodes repeats the previous pixel a given number of times (run-length encoding), which could be implemented using a subset of the LZ decompressor I established as part of the hypothetical Output Unit.
There’s a couple opcodes encoding a diff from the previous colour, at different bitwidths for that diff. Trivial for our hypothetical Arithmetic Core to do! Then there’s a couple literal opcodes.
For each new colour (even for literals) we output we write them into a hashmap at: (r * 3 + g * 5 + b * 7 + a * 11) % 64
. Here I’d send them through the Arithmetic Core so that it can compute this hash & write the colour into that address whilst outputting it.
It’d be useful for QOI if I could tell this Arithmetic Core (using a bootstrapping program?) to run brief instructions whilst leaving RGBA values in registers, & the hashmap (top of RAM would be trivial to access!) in RAM alongside its hashfunction. The hashing should take about a dozen opcodes, given how primitive the ones I’ve defined are, plus the code to output & store the pixel.
The compositor simply dereferences.
PNG
PNG is a popular image format designed in the 1990s to compress images effectively whilst avoiding infringing any patents of the time. And avoiding compatibility issues between programs which support the format, since partial-support had been an issue for previous file formats.
The core compression logic is Lempel-Ziv 77 a.k.a. Deflate, same as you’d find in a Zip file. And which I’ve already established that I’d dedicate circuits to in our Output Unit. But that was designed for text, whilst images need to be “filtered” to meet Deflate’s assumptions.
PNG has a handful of different ways in which it might filter each row of an image:
- Identity transform (that is: no change)
- Diff from previous pixel
- Diff from above pixel
- Diff from average of above & left
- Choose between top, left, & topleft pixels by which has the smallest diff from
left + top - topleft
.
To handle cases involving edges any pixels above or to the left are 0.
After decompression a decoder de-filters to produce red, green, blue, & alpha channels, yielding them to the Compositor Co-Processor for it to perform 4 fetches to lookup the colour for each pixel.
JPEG
JPEG is a popular & longstanding lossy compressed image format (pedantry to be had) designed to efficiently store the natural detail in photos. Whilst not storing the imperceptibly fine details. Its an ISO standard, so I’m left referring to the myriad of secondary sources.
It shrinks memory/bandwidth requirements by roughly a tenth, trading off more compute to decode it. How’d I implement it in the hypothetical hardware I’ve established?
The Parsing Unit would decode the container format JPEG sits within, & Huffman-decode the actual image data. The Output Unit would perform run-length decoding.
The rest of the JPEG serves to extract redundancy (with some fudging) to expose to this lossless compression. It is well suited to the Compositing Co-Processor’s multiply-summer & random-access texture fetches, though it is a lot to compute live per-pixel. It’d especially benefit from being loaded into an atlas! And from special resizing logic.
After multiplying some constants to undo JPEG’s “quantization” (where detail is lost) we can use a handful of multiply-sums (including both additions & subtractions in that sum) accessing the data in a more-compressible zig-zag pattern to move from processing frequencies to samples. Judging by the JuicyPixels implementation I’m using, no trigonemetry is needed for decoding, which would significantly simplify implementation.
Now we have a brightness channel, & 2 colour channels.
The colour channels need to be enlarged to twice their size horizontally and/or vertically, since our eyes have more brightness sensors than colour sensors. There’s several tweening formulas we can use for this.
And we’d need to convert from YCbCr colourspace to RGB. This takes a handful of multiplies, adds, & subtracts clipped to not overflow a byte. Colour TVs have long implemented this formula in hardwired analog circuitry.
Then there’s a “progressive encoding” which sends a smaller-scale image before encoding additional details of larger images. This would require some additional Compositing Unit logic to merge these refinements, whilst allowing resizing to be largely accomplished by truncating the file.
SVG
SVG is an XML-based vector-graphics image format. Its very well suited to iconography, diagrams, charts, & in many cases logos. Things which are common online!
On our hypothetical hardware I’d reuse the XML & (a slight tweak on) CSS-styling infrastructure I’ve already established. With a bit of desugaring I’d parse the path
attribute, amongst other XML-attributes/CSS-properties, to the Layout Co-Processor to initiate “rasterization”.
The tree-SIMT would serially iterate (to maximize adders available to assemble into multiply-summers) over the points to multiply them by a matrix transform, then it’d simplify the paths before interpolating the curves into line-strings.
One path-simplification pass would concurrently check whether each pair of contiguous points are imperceptible close. Another path simplification pass would find the point on a subpath (initially the full path) furthest from the approximating straightline. If the difference is imperceptible replace that subpath with its straightline approximation, otherwise split by that point & (concurrently, in our case) recurse.
If we’re rendering a stroke we duplicate the path & shift each line segment in opposite directions along their normal by half the width. Adding connecting lines as specified by the SVG file.
Vector Fills
I’ve just described how SVG is parsed/processed into shapes to fill. So how do we fill these shapes? There’s 2 common algorithms: Tor-Scanlines (optimal for curves) & Bentley-Ottmann (optimal for straight lines). The hypothetical hardware I designed specifically for laying out & rendering HTML/CSS webpages is very nicely suited for Bentley-Ottmann!
Here we start by converting the line segments into START events & heapify them by the uppermost point (horizontal lines are discarded).
For each point we update a sorted list (a mostly-balanced binary tree would work great on the Tree-SIMT I described) of line-segments intersecting the horizontal scanline also intersecting the event, caching the iterator to optimize for the contiguous line segments we’re dealing with. Inserting, swapping, or deleting line-segments.
We can not only derive trapezoidal output from this, but it also helps us compute INTERSECTION & END events to priority-enqueue.
That yields a (large) sorted list of trapezoids for the Compositing Co-Processor to assemble into rasterized image according to the SVG XML-attributes/CSS-properties. Which is just what the Compositor wants! (It can handle trapezoids due to circuit included predominantly for non-rounded borders) It could even do so concurrently as these trapezoids are being computed! Include a merge-sort & we could rasterize all SVG paths simultaneously.
To save on memory for later frames I’d render SVGs’ sprites into rasters during static frames.
That covers SVG’s core…
Fonts
Having just described how I’d implement vector graphics on our hypothetical hardware: Do you know where we see (for those of us who do see) vectors most often on the web?
Text is rendered via vectors! Modern font files contain bytecode programs for each “glyph” (roughly corresponds to a letter) emitting a vector graphic to represent it.
The fancy branching hardware put in for the CSS display
property can interpret those opcodes.
Furthermore we can collect all the glyphs used for each font file & take full use of the tree-SIMT’s concurrency to render them concurrently into a “font atlas”. Encoding it as a bitmask to save on graphics memory, & attempt to fit the compute in the (small) time between frames. We’d want to recolour these glyphs anyways as we composite them onscreen!
To handle opacity
correctly we’d need to mix all the glyphs (or whatever) into a single layer before applying opacity. Lets assume we have a half-decent display, to avoid me describing we’d hack enough resolution out of it for text to be semi-legible. Which we did use to do in the 1990s, the code is still with us to this day!
As for where these fonts come from… Since in my hypothetical we don’t have an underlying OS I’d make all fonts webfonts. Only some are defined in the useragent stylesheet, whilst others might be defined in userstyles. No features lost!
In the layout-loops prelude (which’ll compile specially from the CSS DSL to use all available hardware) I’d have the Parsing Unit scan the font-face
collection to find the closest-matching at-rule. These would be downloaded & parsed to metacollections.
The Tree-SIMT can then query the various “tables”, which modern font files encode their data into, parsed by the main processor to inform its layout calculations for text data.
Video Playback
There’s a lot of videos online, some are even good entertainment! So how’d I reprogram this hypothetical hardware-browser to play these videos in, say, Xiph’s formats? It turns out: Very straightforwardly!
Xiph designs quality royalty-free public-standard media formats, being possibly best known for FLAC. Though by the time their message got through to industry those royalties are now a moot point. Still… I like Xiph’s formats & documentation better, less design-by-committee!
For any of these formats, I’d task the Arithmetic Core with, amongst other things, controlling the output timing.
Xiph Ogg
Ogg is a lightweight container format for multiplexing media streams (audio, visuals, text, etc) into a single file with error detection & seeking. A very elegant one! Inner streams might care how they’re “chunked”.
Each chunk(s) in an Ogg file are preceding with a header consisting of:
- Identifying bytes “OggS”
- Version
- Start-stream, end-stream, & continue-chunk flag bits in a flags byte
- Position in substream-specific unit
- Substream identifier
- “Page” counter (error detection)
- CRC (error detection)
- Length fields
The pre-established Parsing Unit, with Output Unit help, is perfectly suited to parsing these headers & having the Output Unit demultiplex the results! Using the Arithmetic Core for error detection, & maybe a dedicated CRC circuit. These are even simpler to implement in hardware than they are in software!
There’s some twists in how length is encoded to ensure that both tiny & massive “chunks” can be stored as efficiently as moderately-sized ones. For tiny chunks we store an array of lengths (each 1byte), so they can share a header. Runs of 255bytes-long chunks are concatenated with the subsequent chunk. For massive chunks we use that continue-chunk bitflag I mentioned. The pre-established Input Preprocessor can handle, with or without Output Unit or Arithmetic Core help.
Seeking in Ogg Files
Seeking is handled more like on DVDs/CDs as opposed to previous container formats: using a weighted binary-search. When we seek to a guessed byte offset, we can scan to the next occurrence of “OggS” to locate the next header & determine where we’re at in the file letting us refine our guess. If the error detection checks out we’ve near-certainly found the start of an Ogg chunk. Other formats have included a table-of-contents, but Xiph decided against it for performance & coordination reasons.
Once we’ve defined a convention for pull-pipelines to seek to a given offset, & added optional-support for it in the conversion from push to pull pipelines, the Parsing Unit could locate those chunks under the direction of the Arithmetic Core. Also, you may have noticed filetypes of the substreams aren’t specified in that header, Xiph prefers making sniffing trivial. That’s simpler! And the Parsing Unit can handle it easy enough.
Ogg/Theora
I’ve already discussed audio (FLAC) & I’ll discuss text-rendering (for subtitles) on its own page, leaving the topic of video.
Xiph’s Theora format (named after the Max Headroom character) encodes a frame (or a metadata fragment) per Ogg chunk, thus relying on the container format. Many videoframes are encoded as JPEGs, or a refinement thereof since small improvements quickly add up in video. Even outside Theora. Since I’ve already discussed how I’d decode JPEG, I’ll focus on Theora’s interframe compression.
A Theora interframe consists of pixel-correction values (page 79 of Thoera spec), per-“macroblock” modes (page 69), & motion vectors (page 72) all huffman-coded. Thus rearranging pixels onscreen to resemble the desired imagery. Interestingly an empty Ogg chunk (page 17) gets decoded to a static frame.
In our hypothetical browser-hardware, we can take those motion vectors & apply some mode-dependent maths (page 72-78 of Theora spec; uses a similar compression strategy to PNG) in the Arithmetic Core to compute some sprites indicating where to place each “macro-block” from the previously-rendered frame. Or the previous “reference frame”. The Compositing Co-Processor would then lookup the appropriate pixel from its last-rendered frame (or reference frame) to render at the computed position. Ontop of which it’d render the JPEG-like “QI values” (Theora’s terminology, page 79) encoding difference from the desired image. Since the Parsing Unit would be kept very busy decompressing these frames (amongst tasks for other streams) I’d offloading sorting these sprites into the order the Compositing Co-Processor expects onto the Layout Unit.
One nuance of Theora is that the video’s frame must be multiples of 16px in both dimensions (page 7). But in its metadata Theora can crop the frame (page 6) to any arbitrary size. This proves tricky for the hypothetical hardware I’ve established if I want videos to render across full width of the screen. For simple cases we could let the framebuffer store more pixels than are sent to the screen. For more complex cases (e.g. cropping widescreen to traditional fullscreen) we could bump the performance of the Compositing CoProcessor & include zoom postprocessing. Since the Compositing Co-Processor has proved quite busy otherwise in this hypothetical, we should be able to justify this… And boosting its performance would allow us to handle more resolutions…
Also: We’d need to ensure subtitles don’t get rendered into the framebuffer which Theora’s interframes would rearrange! That’d mess things up…
Ogg/Daala
Daala (named after the Star Wars character) is Xiph’s attempted successor to Theora under Mozilla funding, though due to its unfinished documentation its difficult for me to quickly get my head around it. Daala’s research was then folded into the AV1 format, using a Matroska-variant container format “WebM” since apparently that’s more bandwidth-efficient. Given Ogg’s elegance I can’t imagine much, but in video small improvements quickly add up.
Daala consists of a handful of new compression technologies
- The boundaries between “macro-blocks” (8px squared in Theora) are blurred to make compression artifacts less visible, using a technique called “lapped transforms” similar to JPEG XR. I’m not clear on further details.
- Colours can be more efficiently compressed within human perceptual limits by treating colours as a vector, separating (quantized) direction from (range-coded, I think?) magnitude.
- Range-coding recursively splits a number-line into equal-probability chunks to produce a “codebook” for how to use fewer bits for more common numbers. Solving a related problem to Huffman-coding.
In the hypothetical hardware I’ve described, from what little I can tell, I’d involve my hypothetical hardware the Compositing Co-Processor I’ve established is extremely suited to merging those directions & magnitudes. Its a vector-multiply!
Ogg/Vorbis Audio
Vorbis (Ogg may not admit to being named after the Terry Pratchett character, but Vorbis does!) is a lossy-compressed surround-sound track for Ogg files. Unlike FLAC (which reimplements relevant parts of Ogg) Vorbis requires Ogg’s chunking & cannot stand alone. How’d I implement it in my hypothetical browser-hardware?
Once bitwise-parsed we’d need to compute a sine-wave “window” covering each chunk, we compute a dot-product (a multiply-sum) with the preconfigured “floor” & reverses a “DCT”. While this math isn’t as bad as it sounds, it is still more than was required for FLAC or Speech Synthesis. However I could reuse the “tree-SIMT”/Layout-Unit to perform these handfuls of multiply-sums per output-sample. Whilst leaving plenty of CPU-time left over to render subtitles and sort sprites.
Then there’s also Huffman-coding for the underlying compression.
Pseudocode
Here I’m writing out what the hypothetical code showing what some of this might look like in languages designed for our hypothetical hardware. This is incomplete rough code, it will have loose ends covering only the bits I found most interesting.
QOI
QOI is, quite intentionally, the simplest format I’ve discussed here, though it would require some work to setup the Arithmatic Core to handle it.
QOI
main: header {
!/Math.out
# Allow loading input as instructions to run upon 0 word.
alu << r7:f LIT
alu << IF0; target <- |alu|; 0 r8: LIT IF0 0 J r8 # Ensure rx is reset before executing the program
alu << CMP rx IF0 0 rx: r8
alu << REF rx RAM: r7 INCrx
alu << r9:+1 PC HALT # Capture hashing function in r9.
# Hash & output a pixel
CONST 255 r0:& r0LIT r1:& r1LIT r2:& r2LIT r3:& r3LIT # Clip output to bytes
SHIFT 0,8 OUT| r0r1 SHIFT 0,8 OUT| r2r3 # Output it
SHIFT 1,0 ADDR: r0r0 ADDR: ADDRr1 SHIFT 2,0 ADDR: r1ADDR # 3r + 5g
ADDR: r2ADDR SHIFT 1,2 r7: r2r2 ADDR: r7ADDR # + 7b
ADDR: r3ADDR SHIFT 1,3 r7: r3r3 ADDR: r7ADDR # + 11a
CONST 63 ADDR:& ADDRLIT SHIFT 0,1 ADDR:- ADDR # hash is at top-of-RAM in 32bit words
SHIFT 0,8 RAM:| r0r1 SHIFT 0,8 RAM:| r2r3 HALT # Write to hash & exit
target << |alu|
QOIrun <- alu
} opcode*
header: "qoif" @.{4} @.{4} (3|4) @(0|1) {
# FIXME: Other metadata to send to the layout unit?
!/gfx/Compositor.out
texture!
out << gLD &texture grPOS # Instructions to run to output a pixel, very basic!
out = texture
}
opcode:
-b1111110 @. { QOIrun << CONST text r0: LIT } @. { QOIrun << CONST text r1: LIT } @. { QOIrun << CONST text r2: LIT J rx 0 0 } |
-b1111111 @. { QOIrun << CONST text r0: LIT } @. { QOIrun << CONST text r2: LIT } @. { QOIrun << CONST text r2: LIT } @. { QOIrun << CONST text r3: LIT J rx 0 0 } |
@00x { QOIrun << SHIFT 0,1 CONST text ADDR:- LIT OUT RAM ADDR:++ OUT RAM HALT } |
@01x {
QOIrun << CONST 3 r7: LIT CONST 2 r6: LIT CONST text
QOIrun << r8:& r7LIT r2: r8r2 r2:- r2r6
QOIrun << SHIFT 0,2 ADDR: ADDR r8:& r7LIT r1: r8r1 r1:- r1r6
QOIrun << SHIFT 0,2 ADDR: ADDR r8:& r7LIT r0: r8r0 r0:- r0r6
QOIrun << J rx 0 0
} |
@@10x {
QOIrun << CONST text r7:- LIT r0: r0LIT CONST 15 r8: LIT
} @. {
QOIrun << CONST text r6:& LITr8 r2: r2r8 r2: r2r7
QOIrun << SHIFT 0,8 r1LIT r1: r1r7 J rx 0 0
} |
@@11x { out * text }
Path Simplification
Before rasterization-proper SVG & OpenType vector graphics need benefit from removing imperceptible detail at the display’s resolution.
Contiguous Simplification
[/lang/op/prelude.op];
[path.op];
simplify0 :: L Step -> L Step;
simplify0 = foldFilter isContiguous (Nothing, Nothing);
isContiguous x (Just y) Nothing = x ~= y, Just y, Just x;
isContiguous x Nothing (Just y) = x ~= y, Just x, Just y;
isContiguous x (Just y) (Just z)
| x ~= y && y ~= z = True, Just y, Just z { NOTE: We need to simplify concatenated y & z too! }
| otherwise = x ~= y || x ~= z, Just y, Just z;
isContiguous x Nothing Nothing = Just x, Nothing
{ `foldFilter` would propagate values up the binary tree removing needs as determined by callback,
& reconsidering filter for the subtrees it merges doing so. Highly concurrent }
Gradient Simplification
[/lang/op/prelude.op];
[path.op]; [geom.op]
simplify1 :: L Step -> L Step;
simplify1 path = reduced = first path `Line` last path,
foldAnnotated max (linedist reduced) path -> dist :> annotated
choose (perceptible dist) (pivotWith simplify1 annotated) (drawLine reduced);
choose True truthy _ = truthy;
choose False _ falsy = falsy
{ `foldAnnotated` would mark the path to (in this case) the largest distance,
So that `pivotWith` can split the tree at that point & run given callback on both subtrees
concatenating results }
Rasterization
Vectors are lowered to fills, which will use the Bentley-Ottmann algorithm to lower to Compositing Co-Processor sprites.
Bentley-Ottmann
[/lang/op/prelude.op];
[geom.op];
rasterize :: L Line -> L Sprite;
rasterize = rasterize' Nil . heapify line2event;
Event = Start | Intersect | End
rasterize' scanline events = yieldTraps scanline ++ step $ popMin lineCmp0 events
step ((line, Start), events) = (computeIntersections $ insert line $ iterTo line scanline -> line' :> event',
events' = priorityEnqueue lineCmp0 event' events,
rasterize' line' events')
step ((line, Intersect), events) = (computeIntersections2 $ swap $ iterTo line scanline -> line' :> event1,
computeIntersections $ iterAwaySlope line'-> line'' :> event2,
events' = priorityEnqueue lineCmp0 event1 $ priorityEnqueue lineCmp0 event2 events,
rasterize' line'' events')
step ((line, Intersect), events) = (computeIntersections $ dropTop $ iterTo line scanline -> line' :> event',
events' = priorityEnqueue lineCmp0 event' events,
rasterize' line' events');
{ Need to implement yieldTraps according to SVG parameters }
{ Collections drafted below }
{ Utilizes plenty of geometry }
The collections used there could benefit from merging tree traversal passes!
Heaps
[/lang/op/prelude.op];
Pair a b = `:>` :: a -> b -> Pair a b;
Tuple a b = `,` :: a -> b -> Tuple a b;
Heap a = Branch :: a -> Heap a -> Heap a -> Heap a, Tip :: Heap a;
popMin cmp self = last self -> Branch ret left right :> x,
siftDown $ Heap x left right, ret,
last (Heap x left right@(Heap _ _ _)) = (last right -> right' :> ret, Heap x left right' :> ret)
last _ = error "Non-empty heap!",
siftDown self@(Heap x left@(Heap y left0 right0) right@(Heap z left1 right1))
| x `(`<=` cmp)` y && x `(`<=` cmp)` z = self
| x `(`>` cmp)` y && y `(`<=` cmp)` z = Heap y (siftDown $ Heap x left0 right0) right
| otherwise = Heap z left $ siftDown $ Heap x left1 right1
siftDown self@(Heap x (Heap y left right) Tip)
| x `(`<=` cmp)` y = self
| otherwise = Heap y (siftDown $ Heap x left right) Tip
siftDown self@(Heap x Tip (Heap y left right))
| x `(`<=` cmp)` y = self
| otherwise = Heap y Tip $ siftDown $ Heap x left right
siftDown self = self;
priorityEnqueue cmp self x = heapify cmp $ asList self $ appendL x;
heapify cmp (Branch x left right) = inner $ Branch x (heapify left) (heapify right),
inner self@(Branch x left@(Branch y left0 right0) right@(Branch z left1 right1))
| x `(`<=` cmp)` y && x `(`<=` cmp)` y = self
| x `(`>` cmp)` y && y `(`<=` cmp)` z = Branch y (Branch x left0 right0) right
| otherwise = Branch z left $ Branch x left1 right1
{ Couple more cases for mono-branches }
inner Tip = Tip;
heapify Tip = Tip;
List iteration
iterLeft (L x before after) = rotL' before `rightSet` cons x $ balance after
iterLeft Nil = Nil
rotL' self@(L _ _ Nil) = self
rotL' self = rotL' $ rotL self
{ iterRight is equivalent }
iterTo cmp y self@(L x _ _)
{ FIXME: Find appropriate slot if missing }
| cmp x y -> LE = iterTo cmp y $ iterLeft self
| cmp x y -> EQ = self
| cmp x y -> GE = iterTo cmp y $ iterRight self
iterTo _ _ Nil = Nil
dropTop (L x before after) = rotL' before `rightSet` rotR' after :> x
{ Question: Which way should we swap? }
swap (L x (L y before' after') after) = L y (L x before' after') after
Videos
Ogg is a perfect for the parsing unit!
Ogg Demultiplexer
main: CRC=0x04c11db7 { !/Map.syn; streams <- map } chunks* # What's the actual CRC polynomial?
chunk: "OggS" 0 1(. . . . . . . (1 { out <<< SEGMENT } | 0) isend@. chunk2) validate body
header: 1 8(time@.[4] id@.[4] count@.[4] CRCDAT=0 crc@.[4]) {
# !/checks/Monotonic.syn tracks last time/count in Arithmetic Core to validate next one.
streams << id 0 !/checks/Monotonic.syn !/formats;
out <- stream << id 0;
pos << id time # For seeking code to read
} | 0 8(pos@.[4] id@.[4] page@.[4] CRC@.[4]) {
out <- streams << id 0;
pos << id time # For seeking code to read
}
# X enqueues lengths in Input Preprocessor, read during "@^", trailing .* scans for "OggS".
body: LEN@. (XEND=0 XLEN@255 | XEND=CHUNK XLEN@.)^ @^ .*
# Catch errors!
validate: (LE { error << 1 "non-monotonic" } fail | GT) CRCDAT=crc (CRCf { error << 2 "CRC fails" } fail | CRC0)
fail: .** # Read rest of stream with priority!
Theora meanwhile takes a bit more effort. What I’ve drafted is by no means complete, but can still be mostly-handled by the Input Preprocessor, Parsing Unit, & Compositing Co-Processor!
Theora Decoding
main: identifier comment* setup* frame*;
identifier: 0x80 "theora" vmaj@. vmin@. vrev@. fmbw@.[2] fmbh@.[2]
picw@.[3] pich@.[3] picx@. picy@. frn@.[4] frd@.[4] parn@.[3]
pard@.[3] cs@. 1(qual@.[6] kfgshift@.[5] pf@.[2] . . .) {
# TODO: Handle this configuration!
};
comment: 0x81 "theora" LEN@U32 vendor@* LEN@U32 (LEN@U32 comment@^)^ SEGMENT {
# TODO: Handle this metadata!
};
U32: . . . .;
setup: 0x82 "theora" 1( (- LFLIMS parsing got cropped in the primary PDF... -)
nbits@.[4] (acscale@(. .[nbits]))[63]
nbits@.[4] (dcscale@(. .[nbits]))[63]
nbms@.[9] ((bms@.[8])[63])[nbms]
((0 rpqr@(.|$) { # TODO: Copy data over }
| (1|$) qrbms@.[log nbms]
qi=6 (qrsize@.[qi] nbms@.[log nbms])[{
# TODO: Implement E. Assign qi the value qi + QRSIZES[qti ][pli ][qri ].
# F. Assign qri the value qri + 1.
# & take 62 - ilog(qi) for next qrsize-len.
}]
)[3])[2]
(1 token@.[5] { hts << hbits token } | 0 { # TODO: De-flatten 0 & 1 branches of this tree... } )[79])
SEGMENT {
# TODO: Handle this metadata!
};
# The core logic, all parsed in binary!
frame: 1(0 interframe@. frameheader blocks modes vectors{interframe} qis coeffs SEGMENT {
# Upload data to Compositing Co-Processor via Layout Unit sorting.
};
frameheader: qis@.[6] (0 nqis=1 | 1 qis@.[6] (0 nqis=2 | qis@.[6] nqis=3)) (0 0 0){!interframe};
longrun: nbits==0 | { out = "" } bit@. (lengthcode roffs@.[rbits] {
# FIXME: How much of this belongs in the Arithmetic Core as opposed to Output Unit?
out << bit[1b:rstart + roffs]; bit = !bit; |out| < nbits
})* GE;
lengthcode: 0 rstart=1 rbits=1 | 1 0 rstart=3 rbits=1 | 1 1 0 rstart=5 rbits=1
| 1 1 1 0 rstart=7 rbits=2 | 1 1 1 1 0 rstart=11 rbits=2
| 1 1 1 1 1 rstart=15 rbits=4;
blocks: interframe==0 { bcoded << 1[1b:nbs-1] }
| nbits=nsbs longrun { sbpcoded << out[1b:nsbs-1]; countzeros! << sbpcoded }
nbits=zerocount longrun { sbfcoded << out[1b:zerocount-1]; countones! << sbfcoded }
nbits=onecount longrun {
# TODO: Some nontrivial logic is needed to compute "bcoded" from sbpcoded, sbfcoded, & bits.
# 0 in sbpcoded outputs sbfcoded bit whilst 1 in sbpcoded outputs from bits.
# So express this as parsing `sbpcoded` with axis to `sbfcoded` & `bits`.
# Meanwhile we need to convert from "coded order" to "visual order" whilst
# Computing "superblocks" as indices into sbfcoded.
# The Arithmetic Core can be heavily involved here, finally using a decent fraction of its 16bit address-space!
}
modes: interframe==0 { mbmodes << 1[:nmbs-1] }
| { !/Map.out; malphabet <- map } modecodes
(- TODO: consult malphabet to decode modes in reordered from coded to visual order -);
modecodes: 0 (mi@.[3] { malphabet << mi 0 $0 # FIXME: convert mi to Huffman codes })[7]
| (- harcode mappings, mode=7 has 3bit keys -);
vectors: 0 (mvx@offsetcode mvy@offsetcode)[nbms] | 1 (mvx@signflagged mvy@signflagged)[nbms] {
# TODO: Evaluate mvx & mvy against corresponding mode.
};
signflagged: ret@.[6] (1 { out << ret } | 0 { alu << LIT ret OUT-: CONST; out << alu });
offsetcode: {- TODO -}
qis: { # TODO: Clear bi array } ({ # Assign nbits count of nonzero bcoded bits for which corresponding qii==index }
lengthcode { # Populate qii entry })[nqis - 2];
coeffs: (- This is where parsing gets convoluted, with a bit of metaprogramming to interpret Huffman tables from header! -)
(- Not listing it here... -) {
# Compute multiply-sums to invert DC-prediction... This'll be run during compositing!
};
I’m not gonna rewrite Theora’s standardized logic into Compositing Unit formulas (which here indicated by the “sprite” functions)!
Theora Rendering (Layout Unit functional code)
[/hw/compositing.op];
[/lang/op/prelude.op];
[blockorder.op]; { Glossing over conversions between visual & coded orders... }
rpyw = 16*fmbw; rpyh = 16*fmbh;
rpcw | pf == 3 = 16*fmbw | otherwise = 8*fmbw; rpch | pf == 0 = 8*fmbh | otherwise = 16*fmbh;
reconstruct = map reconstruct' $ 0...pred nbs;
reconstruct' bi
| bcoded ! bi != 0 =
mbi = macroblock bi, qti | mbmodes ! mbi == 1 = 0 | otherwise = 1,
rfi = (1:0:1:1:1:2:2:1:Nil) ! mbmodes ! mbi
ret | rfi == 0 = predictorsprite prevrefy rpyw rpyh nomovement mbi { FIXME: Correct arguments? }
| otherwise = (
refp | rfi == 1 && pli == 0 = prevrefy
| rfi == 1 && pli == 1 = prevrefcb
| rfi == 1 && pli == 2 = prevrefcr
| rfi == 2 && pli == 0 = goldrefy
| rfi == 2 && pli == 1 = goldrefcb
| rfi == 2 && pli == 2 = goldrefcr;
rpw | pli == 0 = rpyw
| otherwise = rpcw;
rph | pli == 0 = rpyh
| otherwise = rpch;
mv = mvects ! bi;
if (round (x mv) && round (y mv))
(predictorsprite refp rpw rph mv mbi)
(halfpixsprite refp rpw rph mv mbi)
)
if (ncoeffs ! bi < 2)
(trivialDCsprite ret acscale dcscale bms nqrs qrsizes qrbms qti pi qi coeffs)
(blocksprite ret (qis ! qiis ! bi) acscale dcscale bms nqrs qrsizes qrbmis qti pli (qis ! 0))
| otherwise =
refp = (prevrefy:prevrefcb:prevrefcr:Nil) ! pli;
rpw | pli == 0 = rpyw | otherwise = rpcw;
rph | pli == 0 = rpyh | otherwise = rpch;
clearsprite $ predictorsprite refp rpw rph mv mbi { FIXME: Which args are needed? }
main = loopfiltersprite:sort spritecompare reconstruct