Mozilla Readability

Thanks to (and I think I recall someone else being involved), a new Odysseus release is coming out soon support a “reader mode”.

I find it rediculous I feel need to support this feature, it’s saying “webdevs are doing such a poor job that I need to offer to clear away their mess!”

In celebration of this I will describe how this code (the same as used in Firefox and Pocket) works.

When a page loads, this code injects some JavaScript to check if it’s probably “readerable”.

That isit looks through any visible <p>s (not in an <li>), <pre>s, or
-containing <div>s with more than 140 characters, discards some depending on class names, and sums the square roots of any remaining character count.

If so it sends a message to the UI telling it to show the button offering a reader mode.

When you click that button, there are three layers to the JavaScript ran in the page in order to remove it’s junk.

The first layer uses document.write() to drop the existing markup from the page. Then in addition to the page’s text, it adds back in it’s extracted title and byline. It also computes an estimated reading time @ 200 words-per-minute, before removing any attribute besides “src” and “href” and annotating the page with a theme class.

The next layer might (if configured to do so) consider giving up if there’s too many elements on the page.

Then it removes all

From there it examines the metatags for useful information (filling in any missing excerpts), and after layer 3 it (unnecessarily here) makes links absolute and removes any classes.

Layer 3 considers elements which:

From there it scores each of those elements by number of:

These scores also count towards the parent elements, but scaled by depth (especially beyond depth 2)

From there it scales these scores by how much of the text is in links (a likely indicator of navigation) and tracks only the top 5 candidates.

Then it looks to see if it captures more useful text by looking at ancestors and/or siblings.