Classification of Content

The Translation Proxy distinguishes two main types of content: text content and resources. The key difference is that text content may be translated, while resource content is treated opaquely, and only references can be replaced as resource localization. It is possible to reclassify entities from Resource to Text, but not the other way around.

During proxying, resources are enumerated, and any already-localized references are replaced, while text content is passed to an applicable Translator implementation for segmented translation.

Text content

By default, the Proxy only handles responses with Content-Type:text/html as translatable. To process HTML content, the source response’s content is parsed into a Document, then text content is extracted from the DOM-nodes. Additionally, various attribute values are processed (without additional configuration, title and alt).

The content is then transformed into SourceEntry entities server-side. Each block element comprises one source entry, with a globally unique key. If segmentation is enabled on the project, the appropriate rules are loaded (either using the default segmentation or by loading the custom SRX file attached to the project), and the content is segmented accordingly, with the resulting token bounds stored in the SourceEntry.
Along with the SourceEntry entities, the corresponding TargetEntry and SourceEntryTarget entities are created. TargetEntry entities, as the name suggests, hold the translations; SourceEntryTargets act as the bridge between the two, and hold the segment status indicators for both.

The content of source entries is analyzed in the context of the project, and statistics are computed. These statistics include the amount of repeated content at different confidence levels based on the similarity of the segment - The Translation Proxy differentiates five levels of similarity:

102%: Strong contextual matches: every segment in the block level element (~paragraph) is a 101% match, where all the tags are identical. These matches do not result in the creation of new SourceEntry entities, thus changes in one place are propagated instantly to all occurrences.
101%: Contextual matches: both tags in the segment, and contexts (segments immediately before and after) match.
100%: Regular matches: the segment is repeated exactly, including all tags.
99%: Strong fuzzy matches: tags from the ends are stripped out, words lowercased, numbers ignored.
98%: Weak fuzzy matches: all tags are stripped out (may have to be adjusted manually afterwards!), words lowercased, numbers ignored. If the Proxy cannot match the tags between the translation and the source due to excessive differences, all tags are placed at the end of the segment, requiring manual review!

These classifications are reused during memory-powered pre-translation in order to select the best applicable translation or to propagate existing translations.

Resource content

By default, any content with content types other than text/html are treated as a resource, and is not a candidate for translation, only replacement en bloc. This mainly includes application/javascript, text/xml, and various image/* content types. Every resource can be given different replacements per target language, and if required, certain resources (application/javascript and text/xml) can be made translatable after pre-configuration is done. In this case, instead of references being replaced, the appropriate Translator will be instantiated and the content passed to it. This can enable partial or complete translation of dynamic content transmitted as JSON or XML.