hx-wls & HTML-XML-Utils

NOTE: This page is a recollection of what I've written previously. I do not guarantee it's accuracy.

The first thing Odysseus does in order to autodiscover any webfeeds on the pages you visit is to run hx-wls from W3C’s HTML-XML-Utils over the page’s source code, in order to find any links in that page.

hx-wls registers handlers to receive all start tags from a simple parser implemented in Yacc, capturing many different types of link tags, resolving relative URLs to a base provided by the HTML or commandline arguments, and outputting it in a range of formats. And it’s even capable of downloading the HTML to pass through that parser, either using CURL or a naive HTTP/FTP implementation internal to HTML-XML-Utils.

The parser can afford to be so simple because it doesn’t need to construct a document tree, the complex implicit closing rules are entirely meaningless here. And it’s URL implementation is parses them using a regular expression into it’s seperate components, so it can merge them together before textually removing any /..s along with their parent directory.