No programmer can get far at all without reusing existing code provided by the operating system, if not others as well. At the very least we need our software to communicate with other programs (like crucially the Linux kernel), and through them humans, to have any reason to exist. This concept of reusable code is referred to as “libraries” or “modules”.
GNU LibC, following the POSIX standards, are the most fundamental libraries on most UNIX-like systems today. Allowing for I/O, memory allocation, filesystem manipulation, etc. Put simply, it’s primary job is to abstract away the datastructures exposed by both Linux, central config files, GCC, & the underlying CPU. Full details on the linked page.
Related: WeLibC’s blog
C++ is a variant of C which adds syntax for classes & more to C. As such C++ has it’s own standard library atop of C’s to take advantage of these syntactic features.
Though projects may implement their own C++ standard library, or reimplement something similar but not identical via a extremely verbose of C macros. Or use a higherlevel language.
C++’s stdlib mostly just wraps C’s (which itself often uses methodtables internally), so that should make this faster to get through!
I’m ignoring the headerfiles in this study. I skimmed them, they do contain some nontrivial implementation, but nothing that interesting.
Compiling these C++ files works mostly the same as for C, though with added complications throughout. Still nothing compared to what’s required to make optimal use of a variety of modern CPUs!
In the implementation code there’s an array of prime numbers used for implementing hashmaps in those headerfiles, and maybe elsewhere too.
For the C++ filesystem APIs (as provided by GCC) there’s a
_Dir_base struct which wraps
closedir. Seperate classes further abstracts this with a C++ iterator, maybe a depth-first stack, & more.
There’s platform-independant reexports of the basic file operations followed by utils for: reading the time, or filetype & status, properties from the result of the “stat” syscall; reformatting copy file options; copy a file from a source path to a destination path after validating that destination using “stat” & the above functions throwing any relavent errors &
[f]chmoding it (uses
sendfile syscall with byte copying fallback); & platform-independantly retrieve available diskspace.
There’s a large suite of higher level file operations abstracting these, including finding absolute & canonical paths, copying files & symlinks, creating dirs & sym/hardlinks, etc.
path struct (used by directory traversal) which stores a “components” array/stack (split by “/”) for userspace traversal alongside the filepath string, overrides the divide operator to concatenate paths, tracks the filepath type, & provides other filepath manipulation/comparison/hashing utilities.
Introduced in C++ 98…
There’s a simple hash function for doubles involving multiplies, adds, & divides. Strings use FNV hashing, as implemented in the headerfiles.
ios_base contains numeric formatting parameters, & error-throwing methods.
basic_istream has an
ignore(n) method, with error handling & buffering.
[un]hook doubly-linked list methods.
free_list appears to allocate out of arrays.
codecvt implements format conversion upon generics.
There’s a routine for wrapping stdin/stdout/stderr with
>> operator &
getline methods on
basic_istreams retrieves the underlying buffer & reads chunks until it finds the delimitor.
There’s code for serializing & validating float format strings.
There’s a global of type
It ensures you can do trigonometry, etc on long doubles & floats.
There’s functions for copying data from one
basic_streambuf to another until EOF.
__pool_alloc_base will allocator it’s larger chunks (split between multiple callers) via bumppointer falling back to a smallintmap of linkedlist freelists with error handling.
gslice_to_index does something involving populating certain indices of an array with indices &, in part, values computed via an intermediary array.
locale class for parsing locale configuration string, environment variables, & elsewhere. Includes a cache.
There’s a redblack tree implementation. A redblack tree is a binary search tree which enforces a balanced depth via having track it’s own depth-parity. Which we lable as “red” and “black” for the sake of visual non-jargony explanations. Thereby ensuring all keys present are equal fast to lookup!
There’s a Stream implementation for Strings
strstreambuf with read/write indices & reallocation.
__pool allocator has plenty of setup & cleanup, managing a segmented array to allocate out of, with a free array as fallback.
In C++11 the following was introduced to to it’s standard libraries…
A hashfunction on doubles, & another (FNV) on strings, for use by a hashmap implementation sized to prime numbers (shortcut smallintmap for choosing capacities by length < 13).
A “future” type tracking locked state of arrival of asynchronous data.
A wrapper class around atomic mutexes.
A string implementation for those compiled into your executable, dynamically copied to heap as-needed.
Then there’s new exceptions!
There’s a class for character categorization.
There’s a psuedorandom number generator backed by a “random device” class.
There’s futher wrappers around mutexes like
_Sp_locker & atomic conditions. Upon which the gthread class is built.
There’s a trivial snprintf implementation.
There’s a class representing locale-related configuration.
“futex” class wraps both those syscalls (as well as time syscalls) & userspace atomics.
“systemclock” class wraps the gettimeofday syscall.
There’s an implementation of swappable buffers to underly I/O streams (
Threads have methods wrapping other syscalls.
There’s copy-on-write strings which reallocate into heap upon first write.
There’s an atomic flags class.
There more locale classes.
There’s text decoding.
There’s debugging utilities for outputting human-semireadable data & attaching profilers to sequences/collections.
In C++17, the following were added to the standard library:
There’s a an allocator out of a pre-allocated buffer used for parsing & formatting between numbers & strings. Error handling is involved.
For some reason there’s a struct wrapping fixed-sized int types with operator overloading.
Additional filesystem classes were introduced, including ones for iterating over directories, filepaths (including copy, create, equivalence, etc operations) largely wrapping previous implementations.
There’s a reimplementation of formatting floating point numbers to (decimal or hexadecimal) text.
There’s an arena allocator, with it’s own bitset implementation & suballocators, whereby it increments pointer into a “chunk” until overflow. At which point another chunk will be allocated. All memory allocated in a “pool” waits until the pool’s free to free themselves.
Much of this implementation for C++17 is built on code taken from “ryu”…
The code GCC’s implementation of C++17’s takes from Apache’s Ryu library uses lookuptables with multiplies, adds, shifts, & comparisons for postprocessing, to compute powers, inverse powers, corresponding divide, & (seperate lookuptable) reencoding. It may use similar math (without lookup tables) to compute multiplies, shifts, & divides, remainders, etc.
It decomposes floats to, using those power computations, a digit lookup table, & debugging messages, to format them as text.
For C++20 I just see an internal headerfile defining
basic_[i,o]stringstreams. So I can’t comment further. C++20 didn’t change the standard libraries much…
Then there’s the more extensive public headers which contains more of the implementation…
When communicating textually it is important to be prepared to teach your program the (natural) languages spoken by its prospective users. I don’t think its reasonable to expect devs to “localize” the software themselves, but thanks in large part to Gettext it is definitely reasonable to expect us to “internationalize” our software!
I’ll study Gettext following along the dev pipeline of extracting, translating, compiling, & applying the translated UI text.
UI String Extraction
To use Gettext you mark any UI text in your programs with the
_ macro (or variants) & run the
xgettext command over your source code.
After extensive initialization including parsing extensive per-computerlanguage lists of flags into a hashmap
xgettext parses extensive commandline flags. Then handles –version & –help specially before extensively validating/normalizing those commandline flags. After possibly reading input files from a file, it appends the remaining commandline args.
Then gettext considers the issue of textencoding normalizing everything to UTF8, allocates an array, generates a plaintext metadata entry to append, possibly parses the previous .po file (using a plugin like any other language) to append to its previous entries possibly translating its charset.
And finally iterates over each input file before cleaning up, sorting results, & outputting them as serialized text. For each it possibly parses a rulelist from XML or infers computerlanguage (based on filenames, particularly extensions where there’s a fastpath, & XML root tag), & based on that info it parses the file using a selected callback or traverses an XML file looking for translatable text.
C UI String Extraction
I’ll describe how the callback for extracting strings from C (and C++/Objective-C) works here, without exploring the other callbacks. They all work more-or-less the same way with a few exceptions. Like for Ruby it delegates to
This involves initializing various globals including of what to look for (KDE extends this) then balances parens like a LISP parser.
Balancing parens involves a recursive innerloop (restart innerloop once its balanced the parens it found so far) examing each C token (lexer described in subsequent toots). If the token’s an identifier it’ll set a flag & precede immediately to symbol logic without unsetting it. If the token’s a symbol it unsets that flag & swaps out its iterator whilst handling ObjC specially. For LParen tokens it recurses exitting upon EOF before nulling out the iterators & unsetting the flag. Upon RParen it closes the iterator & exits the innermost recursive loop. For Commas it resets context, iterators, & the flag. For colon it handles ObjC specially or nulls the iterators resetting the flag in either case. For string literals it saves them aside with or without checking (via previously-saved state) that they’re in an appropriate macro call before nulling iterators & resetting flag. Closes iterator & exits upon EOF. Otherwise nulls iterators & resets flag.
The C lexer it uses operates in 9 “phases”. The topmost phase of which lightly postprocesses the tokens largely serving to lookup all names in its hashset falling back to calling them “symbol” tokens.
Phase8 is split into 4 subphases. The topmost concatenates consecutive strings. Phase8c strips ‘@’ symbols preceding strings when lexing ObjC. Phase8b drops whitespace. Phase8a lowers inttype macros to strings.
Phase6 (phase7 got rearranged) for each phaseX token preceded with a ‘#’ it looks for the end of line or “define” token buffering these into a “pushback” buffer. Then checks whether that was a linenumber macro to update that global counter before freeing the buffered tokens & clearing saved comments. The body of “# define”s are left in the tokenstream in case they contain UI text.
PhaseX lowers ‘#’ tokens differently for start-of-line vs mid-of-line.
Phase5 does much of the work branching over phase4 inital char!
For EOF chars phase5 emits a EOF token. For newlines it emits end-of-line tokens. For other whitespace it collapses subsequent non-newline whitespace before emitting a whitespace token. For letters & ‘_’ it scans subsequent ones & digits capturing them in a buffer with extensive special handling for C++11 strings before emitting name tokens. For apostrophes it retrieves the next char & emits a character token. Parens, commas, hashes, colons, & (for ObjC) ‘@’s each have their own token. For ‘.’ it decides whether that’s a decimal number or symbol. For digits or decimal numbers it scans & buffers subsequent chars emitting a number token parsed via standard lib. For doublequotes it scans each phase7 char until it sees quotes emitting a string literal token. Otherwise it yields symbol tokens.
Phase7 handles lexing escape sequences in string or char literals for phase5.
Phase4 lexes, removes, & saves aside comments to be attached to translation messages.
Phase3 handles escaped newlines.
Phase2 optionally decodes “trigraphs”, which appears to be a concept grappling with limitations of early keywords. By interpreting ‘?’-prefixed chars as a different char, e.g. “?(“ = ‘[’.
Phase1 counts line numbers & strips escaped newlines.
Phase0 handles alternate newline encodings. Uses
ungetc as a scanner.
To start a translating a program using Gettext into a new language you run
msginit to copy the extracted UI strings from a .pot file into a new .po file. Today I’ll study how
After initializing its own internalization (amongst a couple other things) & parsing commandline flags (handling –version & –help specially) validating there aren’t any more commandline args
msginit locates the .pot file in the directed if not explicitly given & warns about using the “C” locale.
msginit normalizes the localename, generates an output file name if not explicitly given warning about overriding existing files, parses the .pot file (next toot), on Windows overrides $PATH, normalizes metadata on each of the UI strings, ensures there’s the correct number of plural forms (extracting that count from metadata somehow, I’m failing to find all that code) OR for English there’s variant which prefills from the POT file, outputs these UI strings, & exits.
Opening the .pot file handles “-“ specially & incorporates a configurable to search relative paths within as well as a couple extensions it’ll try appending. Successfully opened file’s returned to the parser.
The parsers allocates/initializes some initial state to be cleaned up on exit; before calling a
parse method surrounded by setting/unsetting a global, calling
parse_[de]brief methods, & error reporting.
Parse method is given externally. Other methods collect/track parsed data.
msginit can parse one of 3 “input formats” the default being ofcourse
The .po lexer is handwritten with a scanner dealing with escaped newlines & full UTF8. This lexer repeatedly branches over the initial ASCII char, handles UTF8 EOF errors, or complains about unquoted Unicode characters.
|For newlines it unsets some flags. For otherwhitespace it continues the loop. For ‘#’ it checks whether its followed by a ‘~’ or ‘||’ (has special meaning) to skip that prefix & set a flag.|
Otherwise ‘#’ scans to the end of line with or without emitting a token as per config. Double quotes collects all chars until the endquote warning about newlines or EOF & resolving escape sequences. ASCII letters, ‘_’, or ‘$’ collects subsequent such chars or digits to convert into the appropriate keyword token or an error. For digits it scans subsequent digits, parses via the LibC, & emits a NUMBER. Brackets are each their own token with state updates.
The parser is implemented via Bison.
Said parser is relatively simple with rules to parse comment (own token), domain (domain & STRING tokens), message (some combination of intro, stringlist, MSGSTR token, & pluralforms). Message intros in turn consist of a MSGID token possibly preceded by MSGCTXT & stringlist and/or PREV variants of these tokens. Pluralforms consist of MSGID_PLURAL token & a stringlist. Each pluralform consists of a MSGSTR token, bracketed NUMBER token, & a stringlist.
A stringlist is a sequence of STRING tokens.
The stringtable & properties alternate input syntax more resembles the C lexing.
For output it first checks whether there is anything to output & whether we want to output the nothing anyways. It checks if/how it should complain about there being multiple domains based on given output syntax class, otherwise checks whether/how it should complain about plurals & contexts. Checks whether we’re outputting to stdout & whether to configure colouring, & calls
The output syntax options are some as for input but independant from it.
The default PO syntax iterates over each message outputting the metadata entry specially before extracting the header & in turn charset to inform how it outputs each message obsolete ones last.
Outputting a message involves outputting all the different comments first then message context (with optional charset warnings), message ID, plural ID, & message string(s).
Colors added via a rabbithole of a bundled library. I’m not going to discuss libgettextpo responsible for Gettext’s fancy formatting since it looks like quite the tangent! Though, ofcourse, colouring is optional. Interestingly it does support outputting to HTML for this formatting!
There’s a complex utility for wrapping text string with preceding identifier within the PO writer module, amongst other simpler helper functions.
The other output syntaxes are quite straightforward without colouring & text wrapping support.
As software continues to develop it’ll add (and remove) UI strings for localizers to keep on top of. If the devs are considerate they’ll stick to a lifecycle where over certain periods they’ll avoid adding new UI strings so the localizers have a chance to catch up.
To incorporate these upstream changes into their localization where they can localize it localizers use the
After initializing its own internationalization & parsing commandline flags validating 2 additional commandline args are given
--help specially, validates commandline flags, in update mode ensures output syntax matches input syntax, tweaks flags for use by
msgfmt, possibly initializes OpenMP, calls the core logic, applies desired sorting, handles update mode specially whilst outputting the messages just like
In update mode it might sorts obsolete entries to the end, checks whether any changes where made, & if so generates a backup file & outputs the messages.
The core logic involves reading the two inputs (just like
msginit), ensures there’s a metadata message, iterates over the POT file looking for charsets in the headers, considers everything to UTF8 or otherwise handles text encodings, handles charsets specially in fuzzymode, allocates an output messagelist, in multidomain mode allocates sublists (whilst matching domains, validating plural expressions, & iterating over existing definitions to find previously-translated match fuzzily or not to combine) otherwise skips the sublist allocations, optionally performs postprocessing regarding source locations for
msgfmt, & validates results.
The next step of the internationalization/localization workflow described by Gettext’s manual is to actually edit the localizations! Said manual mentions desktop & emacs interfaces, but personally I’m more tempted to study a web interface “WebLate”. Since (even without JS, just forms) the web is a handy way to quickly gather information from people!
WebLate is a self-hostable Django webapp allow multilingual facilitating translations contributions to software projects which use tools like Gettext!
Upon Django’s builtin accounts system WebLate adds:
- A utility for replacing a user with placeholder fields so you can “delete” their account without deleting their activity
- Track verified email addresses
- Integrates “social” signin including GitHub with (via Celery) periodic cleanup
- eMailing them or web admins, with WebLate-specific topics.
- Basic password checks (without anything silly!)
- Change logging
- Reachability-validated avatar images, with Gravatar integration & cleanup shell-command
- Math captchas (looks screenreader inaccessible…)
- Log-in by email
- Django Admin integration for audit logs, profiles, & verified emails.
- Transferring users between WebLate instances via JSON
- Demo accounts (for advertising WebLate)
- Password resets
- Extends templating language for socialauth
- Preferences forms
- Localization & other config
WebLate bundles a suite of “addons” you can enable or disable which upon database update may:
- Ensure all components within a project accept localizations into the same languages.
- Automatically add or remove components to reflect the Git (or whatever) repo, logic’s in the
trans“app” so I’ll discuss later.
- Generate activity summaries before each database change according to configured templates
- Adjust JSON formatting settings.
- Automatically generate .mo files.
- Automatically execute external scripts
- Flag new UI strings as needing editting
- Flag new translations as needing editting
- Flag where translated text is same as internal text as needing editting
- Update ALL_LINGUAS var in the configure file
- Various/aggregated cleanup tasks
- Automerge new UI strings
- Autosort Java properties files.
- Configure Gettext formatting
- & extract authors from Gettext comments
Each of these have a form integrated into an admin web & commandline UIs. There’s a common superclass & database model holding JSON for addon-specific configuration.
To provide an HTTP API WebLate has a routing util for capturing which virtually-hosted WebLate instance is selected, an auth token, declarative serializers via
rest_framework, routing paths to views via that utility, & various views returning results of database queries.
WebLate’s custom “auth” Django-app for some reason implements its own permissions system with WebLate-specific instances, Django-Admin integration, migration utility functions, & various utilities for checking these permissions. Bundles its own
setupgroups management commands. Declares its own Group & User classes upon Django’s baseclasses for some reason. Also adds a templatetag for checking permissions for the logged-in user. Auth’s database model can automatically group users based on email-address regexps.
If you’re selling access to your WebLate instance “billing” adds a database model for which plans your customers are on & their invoices, with Django Admin integration. WIth Celery-automated tasks & a management command to compute & send invoices. Or there’s views to download your invoices online.
WebLate has a Django-app which highlights syntax errors as a according to a choice of classes checking different interpolation or markup syntax; involving gather, sort, & overlap removal steps to be rendered via a template. Has database models for which checks to perform. There’s web UIs for viewing these.
Incorporates a datafile of localized language names, Has management commands for listing ignored checks, list top failing checks for untranslated entries, & rerun checks.
Supported checks include:
- Whether plural UI strings are actually using multiple plural forms.
- That triple dots aren’t used in place ellipsese.
- Whether multiple translations have failing checks.
- Angular.js interpolation syntax with equivalent fields.
- Missing, or duplicate, plural forms.
- Whether the same string gets localized differently elsewhere, or historically.
- Translations roughly identical to source.
- BBCode markup
- LXML-parsable XML markup.
- XML with same tags.
- Various checks regarding initial or trailing chars.
- Newline count matches (don’t see the code?).
- No zero-width spaces.
- Charcount limit.
These all share a superclass (with variants), and are often hueristic.
For importing & exporting localizations WebLate implements a minor variation upon Python’s builtin
io.ByteIO that adds
name attributes, a
ParseError exception, & infrastructure to call the appropriate importer or exporter class.
Most of these exporters are provide by Translation ToolKit (which I won’t discuss) via wrapper, but there’s also an importer for Excel via OpenPyXL, & automatically determining which importer to delegate to. They share featureful datamodelling superclasses. Each of its many supported importer & exporter formats gets its own, typically trivial class.
The “gitexport” Django-app provides utility functions to compute the Git URIs, & who’s views will proxy Git’s internal
The “lang” Django-app datamodels the human languages being localized into with Django Admin integration. Including standard plurals config, datafiles, display web UI, shell commands, & fuzzy matching.
For the sake of its own localization language data is moved into its own near-empty Django-app.
There’s an Django-app for tracking agreements to legal terms with its own database mode, forms, web UI (with even its own “templates”), decorator for use by other Django-app’s web UIs or a “middleware” around all, Django-Admin integration, & template tag for linking to these agreements.
WebLate’s “machinery” Django-app offers classes integrating into various machine translation services including previous translations on this WebLate instance, dummy translations for “Hello, world!”, deepl.com, glosbe.com, translation.googleapis.com, terminology.microsoft.com, mymemory.translated.net, amagama-live.translatehouse.org or other instances, AWS Translate (via boto3 module), microsofttranslator.com possibly with API keys from cognitive.microsoft.com, SAP MT instances, translate.yandex.net, youdao.com, fanyi.baidu.com, or an Apertium instance.
Which have their own configuration fields, and all have a common baseclass helping to abstract HTTP API usage, prioritizing, rate limiting, & language selection.
WebLate’s “memory” Django-app offers forms, its own integration “machinery” class, bridge over to
whoosh module, Celery-automated cleanup, web UI, & shell commands for importing, exporting, deleting, & listing translation-memory XML/JSON files.
WebLate’s “screenshots” Django-app offers Django Admin integration, database fieldtype subclassing
ImageField with additional validation, modelform, database model, Celery-automated cleanup, & CRUD (Create/Read/Update/Delete) web UI recording illustrative app screenshots clarifying what UI strings refer to. With integrated OCR via tesserocr.
WebLate’s “vcs” Django-app implements support for Git with or without Gerrit or GitHub, Subversion, GPG, Mercurial, & SSH keys as dynamically-loadable subclasses of a common baseclass aiding those in deferring to the commandline; with its own configuration fields.
And there’s various tweaks to the Django Admin, mostly to add a performance dashboard & special SSH keys screen.
For the clientside WebLate vendors Chartist.js, specially-configured BootStrap including their Datepicker, slugify.js, autosize.js, js.cookie, Modernizr.js, Mousetrap.js for keyboard shortcuts, multi.js for nicer multiselects (tells me I’ve got some redesign work to do in Iris…), Clipboard.js, FontAwesome & Font Linux, jQuery, & a couple more fonts for serverside CAPTCHAs alongside its own robots.txt & security.txt.
In general utils WebLate implements for itself includes:
- Akismet spam protection
- data simplification for the Celery API
- retrieving the datadir path for a component, dynamically importing classes
- escaping certain regexp constructs for database queries
- event-disabling whilst loading importing data
- validating external components are working fine
- linking to documentation, error logging to Rollbar or Raven
- a JSON database fieldtype
- removing readonly files
- font rendering via PIL
- extra configuration options
- conditional sum database queries
- retrieving IP addresses & useragents
- ratelimitting outgoing requests
- similarity comparer based on Levenshtein distance as implemented by Jellyfish
- text replacement template tag
- noop localization
UnitDataabstract database model
- tempdirs for unittesting
- validate dependencies’ version numbers
- gather translation statistics, perform various form validation
- & a few utils for building web UIs
There’s symbols & localized labels for EMPTY, FUZZY, TRANSLATED, & APPROVED states.
As well as abstractions around Whoosh, Django Messaging (for use alongside Django ReST framework), Django templating (without autoescaping but with special context), the current site, & Django Templates localization.
At WebLate’s core is it’s “trans” Django-app! This provides:
- events emitted upon version-control system updates
- util to render DiffLib output to HTML
- dataconversion into a dictionary for language statistics
- simple field validations
- debug filter to optionally include useful dev information in error emails
- a regex database fieldtype
- several RSS feeds
- util to retrieve filter choices with IDs & localized labels
- webforms (for translations zen or not, antispam, downloads, uploads [simple, full or extra], various search, merging, reversion, autotranslation, words, dictionary uploads, reviews, letters, comments, engagement, new languages, priorities, context, check flags, adding users, reporting, component settings, create projects & components, project settings & access, text replacement with confirm, language matrices, new units, mass state setting, & license agreements) with checksum superclass
- customized form inputs (for dates, checksums, users, plural textareas & textentries, search filters, fuzziness, & numeric lists)
- fulltext search via Whoosh
- a list subclass for translation checks
- automatic fixes with common baseclass (including replacing trailing tripledots with an ellipsis, removing zero-width space, removing control chars, & whitespace corrections) with utility to apply all enabled ones
- Management commands (list component’s translators, required dependencies versions, reread locale files, “lock” editting of a component, unlock said editting,
git commit, commit by age, rebuild search index, upload translation file as suggestions, benchmark project import, trigger autotranslate, cleanup orphaned checks, compile messages, fixup flags, & import JSON or git repos) with common base classes
- Inject various vars into all templates
- Automatically find locale files to import from registered repos, integrated as a Celery autorepeated task.
- Util to gather special chars to incorporate into the translation-entry inputs
- Rendering of various “widgets” to proudly display in your READMEs
- Django Admin integration
- Various Celery-automated cleanup & automation tasks
- Various stats-gathering & formatting utils for those badges & elsewhere.
For the sake of templates WebLate provides a template tag to extensively format translations including diff rendering & whitespace highlighting, as well as tags to render random WebLate project self-“adverts”.
Also various simple accessors on checks, name database lookups for slugs, counts, numerous links, rendering messages & checks & translation progress, outputting time more naturally, querying message state, retrieving messages, aid a tab UI, & checksum, permission, & legal checks.
But mostly it is lots of viwes & models!
For Weblate is centred around the database models:
- Comments (text, users, & timestamps)
- Component lists (name, slug, whether to show dashboard, & contained components) & automatic ones (regexp on project & component names with foreign to key to underlying list)
- Components (name, slug, project link, used VCS, repo link, VCS push URL, VCS web link, WebLate git export URI, related component, where to email i18n bugs, repo branch, pathpattern for which to translate to localize, whether to allow editting basefile for monolingual translations, basefile for new translations, fileformat, whether it is locked, whether to propagate translations, whether to save history, whether to suggest translations, whether to allow voting, how many votes until autoaccept, enabled checkflags, license for translations with link, contributor agreement link, how to handle new language requests, how to merge translations, which commit messages to use, git identity, whether to push on every commit, how long to wait, regexp on language codes, & a priority to sort by), directly integrates into VCS system & locates matching files itself
- Projects (name, slug, link, mailinglist, translation instructions link, whether to set “Translation-Team” header, whether to use translation memory, degree of public access, whether to require dedicated reviewers, whether to allow updating via remote hooks, & source language)
- Sources (id hash, component, timestamp, priority, check flags, & context description); has method to run associated checks
- Per-component contributor legal agreements (user, component, & timestamp)
- Whiteboard messages (message, whether its HTML-markedup, associated project/component/language, & seriousness level)
- Dictionaries (project, language, source text, & target text)
- Suggestions (target text, user, language, timestamp, & associated votes)
- Votes (suggestion, user, & up/down)
- Change log (associated unit/project/component/translation/dictionary, user, author, timestamp, action enum, target text, old text, & details in JSON)
- Units (associated translation, ID hash, content hash, location, context, comment, flags, source, previous source, target text, state, position, whether it has suggestions, has comments, has failing checks, wordcount, priority, & whether it is pending)
- & last but not least Translations (associated component & language, how plurals are handled as defined by “lang” app, revision, filename, & language code)
Alongside this it defines config fields, & event handling to reflect database changes in repo files. Lots of methods largely relating to VCS, including on Manager/QuerySet classes.
For the core WebLate web UI there’s Django-views for:
- confirming agreement to contributor terms
- setting a user’s groups according to form submission
- paginated changes viewer
- CSV download of those
- view engagement
- download translations to add into your project manually
- machine translation via AJAX
- about, stats, & keys pages
- add a user to a project
- view a project
- create new projects
- create project components
- retrieve yearly activity stats
- retrieve monthly activity stats
- view a component
- view a component’s dictionaries
- upload translations
- remove a user from a project
- view a translation
- edit a dictionary
- retrieve a Unit’s translations (something about JS here…)
- Lock & unlock components or projects (4 views)
- Alter project settings
- Alter component settings
- Find & replace within matching units
- Change public access level for a project
- Render forms to control who can edit a project’s translations
- Error pages
- Delete a dictionary
- Retrieve git status for a project, component, or translation (3 views)
- View all who contributed to localization a.k.a. credits
- Views to respond to GitHub, GitLab, or BitBucket hooks
- Retrieve data for a project
- Upload a dictionary
- Retrieve Machine translation as JSON
- Upload multiple translations simultaneously
- Search all units for some text
- View several counts
- Paginated translations to review
- Expose the various “widgets” online
- Add a newlanguage
- Respond to healthchecks
- view component list
- A homepage which checks for warning messages to show before displaying a user dashboard
- Download a project’s dictionary
- The userdashboard retrieves various activities to display in the HTML template, or shows all & top projects if not logged in
- Add dictionary entry
- Show a project’s dictionary
- Edit the various fields of a source
- View a project/component’s translation matrix
- Detailed per-string translation matrix
- Manage the git repos over the web.
- Extensive handler for submitted translations
- Request autotranslation
- Add a comment
- Delete a comment
- A less busy “zen” translation editor with seperate load & save views
- Create new units
Outside the translation view there;s very little in the way of helper functions beyond what I’ve discussed previously. Though Django’s forms framework is used extensively to interpret/validate user input!
Beyond the more boiler-platey logic
msgcat reads & deduplicates the input filenames from a commandline args & maybe a given file & calls
catenate_msgdomain_list which in turn parses of those files & iterates over them to determine the encoding, again to determine the identifications, count the number of translations for each message, twice to drop needed messages, maybe determine common encoding, determine output encoding (ideally no conversion) if not given by user, apply text reencoding, copy all the messages into a single output catalog, & handle duplicates specially.
msggrep parses 5 input “grep tasks” deferring to a
libgrep to handle multiple regexp syntaxes using them to compile the given regexps, before filtering all messages by filename, msgctxt, msgid, msgid plural, msgstrs, translator comments, & other comments.
msgcomm works basically the same way as
msgcat but with additional globals.
msgconv calls iconv on all parsed messages before writing them back out possibly sorted.
msgfilter runs a subcommand (or builtin function) upon all parsed messages once text-reencoded before serializing them back out possibly sorted.
msgexec runs a subcommand upon all parsed messages once text-reencoded echoing their output instead of serializing results.
msguniq catenates a single file.
msgcmp removes obsolete entries between 2 inputs, extracts textencoding from header fields to ensure if one's UTF-8 other is as well, conanicalizes text for fuzzy matching if requested, allocs an output, iterates over messages to retrieve & display matching entries in other file, & a final iterations outputs strings which weren't present.
msgattrib reads the catalog possibly alongside a allow/block-list catalogues to filter by, & iterates over it to update fuzzy & obsolete flags.
msgen (did I cover this already? Name assumes English is the source language) copies source text to translated text for each entry the parsed file.
You can also build your own utils based on the same library all these commands I’ve been describing use.
Compiling PO files
Once you have fully-enough translated .po files Gettext requires you to compile them into .mo files, which is an on-file sorted parallel array with an optional hashmap index! To do so you you use the
msgfmt command, which can be reversed with the
After initializing its own internationalization & parsing commandline flags handling
msgfmt validates there are additional commandline args unless handling XML or .desktop input.
msgfmt extensively validates those commandline flags, handles .desktop files or directories of them specially echoing the data it parses with added localizations, handles XML mode specially parsing a rulelist & before merging it with all XML in a directory both utilizing an external XML parser, possibly allocates a new domain in lack of an output filename, reads specified input file according to the specified syntax, checks that syntax produces UTF8, & remove obsolete strings.
With special cases out of the way & .po (or whatever) messages parsed
msgfmt now iterates over the catalog domains to check plural formulas match the counts seen elsewhere in the file whilst trial-evaluating said formulas & various other per-message basic checks (i.e. begins or ends with newlines, validates format strings matches, validates both have accelerators, & validates metadata has all necessary fields), then outputs the messages in appropriate syntax & maybe outputs stats.
For .mo output validating there are in fact messages to output
msgfmt deletes the “POT-Creation-Date” header for reproducible builds before opening the output file if not stdout (a.k.a. “-“) taking care not to overwrite existing files and:
- with some arrays allocates over all messages to concate msgctxt & msgid into msgctid, tests for system-dependant strings, parses C or ObjC format strings to see if there's any platform-specific directives, & gathers strings into appropriate array.
- Sort the platform-independant strings if any found
- Computes min output version
- Computes hashmap (if desired) size of a prime number at least 4/3s full & > 3
- Gather a header struct without or without including the headerfields for platform-specific strings
- Optionally apply a byteswap to the header & outputs them.
- Iterate over strings to prepare length & offset fields optionally byteswapped before outputting them.
- Do same for their corresponding translations.
- If outputting a hashmap index alloc/zero said hashmap, insert each entry (using HashPJW hashfunction with increment-rehashing), optionally byteswaps each entry, & writes them out.
- If including platform-specific strings generate an array splitting them by platform writing the segments header out followed by the clustered strings
- Write each original string then all of their translations
- If needed do same for platform-specific strings
After initializing & parsing/validating commandline flags handling --help & --version
msgunfmt parses each input file for the specified syntax possibly sorts messages by their ID, & outputs them back out using the same .po serialization most other commands use!
For .mo files (also C# or seperately Java, C#, or TCL) it opens the file (if not stdin a.k.a. “-“), checks whether we need to swap the byteorder, performs format validations, & iterates over all strings into a messagelist.
Once you’ve gone through the translation process I described the tools for above, you now need to actually incorporate those translations into your software! For this the functions you call to mark text to be translated also looks up those translations to be swapped into the UI.
But first you need to call
textdomain to set the catalog from which it looks these UI strings up.
If its argument is NULL
textdomain returns the current global.
textdomain claims a writelock, then examines the arg further. If its empty
textdomain sets the global (and local) to “messages”. If it’s unchanged it sets the local. Otherwise it sets the global & local to a new copy of the arg.
In anycase if the new local is non-NULL it increments the catalog counter & considers freeing the old value, before releasing lock & returning the new value.
textdomain is a slightly fancy accessor.
The rest of Gettext's API including
_ are trivial wrappers around
dcigettext. Here I’ll describe how
dcigettext works when the args unspecified by
_ are NULLed out. Domains, categories, & plurals will be described later. All of which are handled in this function.
If the UI string is unspecified
dcigettext returns NULL. Otherwise it saves the error number, claims readlocks, & retrieves configured catalogue.
After that initialization it populates a cache-searching, searches that binary-search-tree under a readlock, & returns the looked up translation if it found one cached releasing locks & restoring errorcodes.
Otherwise determines whether it needs to be more careful due to running as root, determines the directory to search for in the path, iterates over all configured locales exiting upon “C” or “POSIX”
mmaping & validating & caching the files so it can lookup translations in them.
If it successfully found a translation
dcigettext updates the cache (checking whether there's an existing cache entry it can overwrite) before restoring errorcodes, releasing locks, & returning the result.
Otherwise if not in a SUID program it checks
$GETTEXT_LOG_UNTRANSLATED to see if (under a lock) it should log the untranslated UI string possibly lazily-opening the logfile to do so. To aid localizers in prioritizing. Then returns the untranslated string!
Searching for a localization in a
mmaped file involves checking if said file has a hashtable. If so it performs a hashmap lookup (HashPJW with increment rehashing until tombstone or match), otherwise performs a binary search over the sorted keys table (almost as fast!).
In eithercase upon success looks up the translation in either the cross-platform or platform-specific arrays, extensively considers whether we need to convert text encodings, & returns result with length.
What with synonyms & context sometimes the untranslated UI string is not enough to identify the appropriate translation! So for disambiguation
dcigettext & some of its wrappers accept a “category” & heavier-weight “domain”.
Categories get validated first-thing (after the untranslated UI string) at the start of each call with
LC_MESSAGES_COMPAT being converted into the default
LC_MESSAGES. They are incorporated into the caching. And is used in determining which locale to use!
I’m failing to to see where the functions it calls to convert the category into a locale are defined even searching online, but I think I can infer they relate to LibC's APIs. Unless this locale is “C” it then consults $LANGUAGE & applies platform-specific preprocessors to normalize format, before returning that priority list.
The category is then incorporated into the filename of the .mo files it should consult. I described how it handles the priority list yesterday.
Domains default to a configurable global, are considered in cache lookup, located within a global path to get a directory to look for the .mo within, are incorporated into the .mo filepath, & domains are incorporated into missing-message logging.
Many if not most natural languages have different gramatical structures (“pluralforms”/”plurals”) to indicate different quantities. Though not every language agrees how quantities map to their pluralforms! e.g. Is 0 plural, not, or something else?
One of those things you might assume is trivial…
Gettext’s facilities for this assumes English as a source language, though I suspect those assumptions can easily be overcome for programming in other languages.
dcigettext & its many wrappers will resolve plurals (count defaults to 0) once it successfully looked up translation in the configured/selected catalogue. If unsuccessful it may optionally apply an English-like “germanic”
n == 1 pluralform between the 2 given strings.
This involves interpreting (for the given count) the plural formula from the catalogue & iterating over the multistring to find the computed index. Which .MO compilation validates via bruteforce stays in-range.
Interpreting the plural formula is done over the abstract syntax tree recursively branching over the number of operands (0-3 inclusive) before before branching over & applying the mathematical operation.
Said expression is parsed when loading in the .mo file by locating the “plural=” & “nplurals=” headerfields of the metadata entry (translation for “”) parsing nplurals via
strtoul after scanning digits, & parsing plural using Bison & a manual lexer. Relatively trivial usage. Defaults to returning an AST representing Germanic
n != 1 expression.