Rewrite of the L10n Subsystem

Originally MidCOM managed the L10n translation tables on a one-snippet-per-string basis, having one subdir per language. While this has not only proven inefficient (it was born out of a NemeinLocaization legacy), it is quite unusable when we talk about a filesystem. Therefore, a rewrite of the L10n system is neccessary, mainly revising the storage format.

Current situation and Requirements

The old L10n framework was build around the following requirements:

Store all data as UTF-8, convert to client's charset only when neccessary.
Strings are identified by ID-Strings

To work around potential encoding troubles with PHP/Midgard, the L10n System stored everything Base64 encoded.

Redesign

The new storage format is aimed at both speed and file-system based data management. The basic idea is using a two-level accociative array, first indexed by language, then by string ID. So you can essentially retrieve a translation string by using the construct $library[$language][$stringID] within the L10n classes.

The most important implication here is the fact, that the complete string table will now be held in memory as of the very first request to a given library. This will increase performance, but in some situations may prove to be a bit of a resource hog (mainly, where AIS is loading all components with their localized name). I think though, that this is still well below 100 k of Memory Usage in the extreme cases as the tables are not that big.

These arrays will be stored to disk by directly serializing them, making later retrival fast and easy. The Base64 encoding will be dropped also, all of the strings will be stored in UTF-8 natively.

Note, that there will be no locking whatsoever upon file access, so if two fellow developers update the same string table simultaneously, only the latter change will survive. It does also mean, that Apache must have write permission to the string table files on the development servers.

The string tables will be loaded only once, cached in a central place (most probably a global), of course; which will automatically give a quite efficient caching during runtime.

The string table file will be stored as $component/locale/stringtable.dat, as outlined in mRFC 6. The component will allow mutliple string tables in a single directory, if neccessary.

API Changes

The API will stay like it is, with two notable exceptions:

Calls to the set member function will no longer cause a write to the filesystem, this would make changes on large string tables a tedious task. To make this easier, a flush()-call will be added, that writes the current string table to the disk.

As the locale directory will be allowed to have multiple string tables in it, the constructor will take an optional, second argument identifying the table you wish to access.

Conversion of the old tables

The conversion into the serialized files will be done using a simple PHP script running on a MidCOM enabled host, preferrably the MidCOM development server. It will "just" skim through the complete list of strings available for a given component and add them to a new-style Array. The serialized output can be written to /tmp or something like that, so that it can be easily moved into the new MidCOM source tree.

A simple conversion script writing everything to /tmp can be found here:
http://cvs.devel.midcom-project.org/midcom/de.linkm.taviewer/l10ndump.html