random thoughts to oil the mind

Tag: Regex

Commenting memoQ Light Resources

Editing memoQ’s light resources can be a painful experience, with somewhat clunky interfaces, intimidating lists, small default window sizes, and the inability to add comments to rules written with regular expressions. Anyone who’s tried to pin down an error in a list of dozens of autotranslatable rules will know the maddening experience of trying to wade through to reams of similar-looking rules to find the culprit, especially as any edited rules are automatically re-added to the bottom of the list, so any initial effort at structuring rules according to their purpose gradually falls apart over time.

As Kevin Lossner recently pointed out, one clever strategy is to export copies of the rulesets and, adding comments to these XML files, essentially manage these rules outside of memoQ. This makes the rules themselves easier to navigate, and indeed edit, with changes being implemented with a simple reimport of the file.

However, there can be a couple of disadvantages to this solution, depending your workflow. Firstly, the comments only work in one direction, as they’re lost on import. If you make any alterations to resources from within memoQ, exporting the updated ruleset would mean having to combine the files or otherwise restore all the comments. Secondly, memoQ prevents you from overwriting resources on import, so while you can always add a new version and delete the previous one, it can prove to be a royal pain if you also need to update a large number of templates and/or ongoing projects which are using the resource.

Fortunately there is one more cheeky option available to us to make our rules easier to read, however, and that is to abuse named capturing groups and use them as comments. For example, take a rule from a cascading filter for tagging the asterisks in a Markdown list:

^\s*\*\s+

We can simply give this group a name and make it easy to identify:

(?<unordered_list>^\s*\*\s+)

Once all rules have been commented, we immediately have a much cleaner overview of the ruleset, especially useful when you need to go back in and tweak or add something later.

Example cascading filter for Markdown

Note that one potential side-effect of this strategy is that mixing named and numbered capturing groups may upset the numbering, so particularly for autotranslatable rules it may be easiest simply to use named capturing groups consistently throughout.

[Photo by Daiga Ellaby on Unsplash]

Migrating phpBB to NodeBB

I recently set about migrating an aging phpBB forum to NodeBB and ran into enough problems that I considered cancelling the whole project.

The phpBB exporter script has been updated various times, and I managed to find a fork which appears to work with phpBB 3.2. Unfortunately, it refused to install itself correctly and appeared to land in the wrong directory, so I had to manually clone the Github project into the expected subdirectory.

And bingo! The import worked, and after disabling/deleting unnecessary plugins and updating NodeBB to the latest branch, the majority of things were working as expected. A few things remain to be fixed, in particular navigating mongodb’s structure to perform a few custom replacements where the import script had trouble deciphering bbcode.

Fortunately StackOverflow provided a good start:

var bulk = db.getCollection('objects').initializeUnorderedBulkOp();
var count = 0;
db.getCollection('objects').find({$and: [{_key:{$regex: /^post:\d+$/}}, {content: {$regex: /<size size="150">(.*?)<\/size>/}}]}).forEach(function(entry){
    var newContent = entry.content.replace(/^<size size="150">(.*?)<\/size>/gm, "## $1");
    print(newContent);
    bulk.find( { _key: entry._key } ).updateOne( { 
        $set: { 'content': newContent } 
    });
    count++;
    if (count % 100 === 0) {
        // Execute per 100 operations and re-init
        bulk.execute();
        bulk = db.getCollection('objects').initializeUnorderedBulkOp();
        count = 0;
    }
})

// Clean up queue
if (count > 0)  bulk.execute();

Using this I was able to find and replace those tags which had been missed and replace them with valid markdown.

[Photo by Kobu Agency on Unsplash]

Ein kleiner Kniff für die Qualitätssicherung bei memoQ

This post is also available in English.

Die Übersetzungssoftware memoQ verfügt über eine nützliche Funktion als teil des QA-Checks, wodurch in der Zielsprache nach verdoppelten Wörtern geprüft wird. In meinem Fall hat es schon mehrmals auf kleine Tippfehler hingewiesen, wo ich versehentlich ein „and and“ oder ein „to to“ geschrieben habe. Dennoch bleibt der Vorteil dieses Checks etwas eingeschränkt, wenn man in seiner Sprache regelmäßig solche Formen hat, die diese Verdoppelung verlangen. Im Deutschen denkt man an Sätze, die die Wörter „die die“ verlangen. Der französische Übersetzer wiederum rauft sich die Haare, als ihm memoQ zum zigsten Mal ein „nous nous“ ankreidet.

Doch mithilfe der relativ neu eingebauten Regex-Funktion kann man dieses Problem tatsächlich beheben. Editiert man seinen Regelsatz für die Qualitätssicherung, kann man den Standardcheck unter dem Konsistenz-Reiter ausschalten, dafür unter dem Regex-Reiter eine neue Regel erstellen, die diesen Check ersetzt aber Rücksicht auf Ausnahmen nimmt. Für Französisch zum Beispiel kann man die folgende Regel als Forbidden regex match in target((Leider sind die Hilfsseiten von Kilgray auf Deutsch nicht aktuell, daher hier die englischen Namen.)) eingeben:

(?i)(?![nv]ous\b)(\b\S+\b)\s+\b\1\b

Wenn aktiviert, prüft diese Regel weiterhin nach doppelten Wörtern in der Zielsprache, einschließlich üblicher Interpunktionszeichen wie bspw. Apostrophen, ignoriert jedoch jeder Fall von „nous nous“ oder „vous vous“. Die Ausnahmen in der Regel vorne kann man dann beliebig erweitern, je nach Bedarf. Die Regel ist bestimmt nicht fehlerlos, aber sie kann die Anzahl der falschen Warnungen enorm verringern, ohne dass man auf diesen Check komplett verzichten muss.

[Foto von Ilya Pavlov auf Unsplash]

memoQ QA Check Tweak

Dieser Eintrag ist auch auf Deutsch verfügbar.

memoQ has a handy little feature as part of its QA check which warns you whenever you double up a word in the target language. I’ve had it catch numerous little and ands and to tos which slip into my work on occasion. However certain combinations of doubled up words are fairly commonplace, which can lead to this feature producing lots of unnecessary false errors. A classic example in English might be two hads in a sentence like ‘I had had enough,’ but that pales in comparison to a language like French, which sees plenty of doubled up words in pronominal verbs (nous nous lavons, vous vous souvenez etc.)

One way to fix this is to make use of the relatively new regex feature built into the QA check. Untick the option to check for duplicate words in the target under the Consistency tab. Then under the Regex tab we can replicate this functionality, while including our own exception to the rule. Add a new rule of the type Forbidden regex match in target, give it a relevant description, and then add this target regex:

(?i)(?![nv]ous\b)(\b\S+\b)\s+\b\1\b

When active, this rule will continue to highlight any duplicate words in the translation, including all the usual punctuation marks, but ignores any occurrences of nous nous or vous vous. Obviously these exceptions at the front can be replaced with whatever is required in the target language. The rule isn’t by any means flawless, and will for example also complain about repeated sequences of numbers, but it can help to reduce the number of false positives without having to abandon the check altogether.

[Photo by Ilya Pavlov on Unsplash]

Powered by WordPress & Theme by Anders Norén