libxml2

Recently I’ve been describing the APIs I call (possibly via Vala’s syntactic sugar) to implement the core of Odysseus’s templating language. Now I want to describe some of the parsers I wrap around it’s datamodel.

So for the sake of rendering webfeed previews, I wrap libxml2 with a wrapper that obfuscates the difference between attributes and elements (for the sake of RSS) and handles Atom’s type attribute.

libxml2 (which despite being a GNOME project doesn’t reuse GObject or GIO, the latter of which is a minor inconvenience to me) starts with a component that wraps it’s input in an “input stream” and interprets the bytes to translate it into calls to “SAX” methods.

That input stream handles reading from an in-memory buffer and possibly a callback function. It also handles tracking line numbers for error reporting and switching text encodings via another callback.

The parser handles verifying that end tags are balanced using the implicit callstack for a stack, and implements support for two different versions of the SAX standard. The callbacks prescribed by SAX will per the default libxml2 implementations construct a linked tree representing the XML nodes and validates it according to the XML scheme.

As for who looks up namespaces, that’s the main difference between the two different SAX versions.

The XML tree constructors are what handles merging consecutive text nodes, and the SAX delegate has a field for the parser to return the constructed tree from it’s API entrypoints.

Callers like Odysseus can then process this XML linked tree directly however they like, possibly with help from utility functions which serializes the tree into XML or contained text.