Extractor

Overview

In some cases, the remote server response deviates severly from the industry standards, but still requires translation. In order to handle these, the proxy has to slice up the incoming stream to extract the relevant contents, then restore the original after processing. This is achieved by using regular expressions to designate patterns requiring handling within unprocessable strings.

All regular expressions are Java Patterns. See the official documentation for the finer points.

Parameters

  • Content Type Pattern: a regular expression that designates content-types susceptible to extractor operation.
  • Validation Pattern: a regular expression that is applied to the incoming content to see if it should be extracted.
  • Extractor Pattern: a regular expression that locates and extracts content from the incoming string. Must contain at least one capture group!
  • Processing Prefix: an optional prefix that is prepended to the extracted string before handling. Will be stripped before content is re-inserted into the original!
  • Processing Suffix: an optional suffix that is appended to the extracted string before handling. Will be stripped before content is re-inserted into the original!

Example

For example, consider the following snippet being sent:

xxx#It's so beautiful!#It's not so beautiful!#xxx
  1. First of all, the content type must be verified to avoid trying to process images and the like:Content Type Pattern: text/.*
  2. Then the content is validated to see if starts with at least one x:Validation Pattern: x+
  3. The content is targeted:Extractor Pattern: \\#(.*?)\\#
  4. Optionally, HTML-processing may be forced using prefixes and suffixes to transform the content prior to handling:Content Prefix: <p>Content Suffix: </p>

Internally, the proxy will see the following when translating:

xxx# <p>It's so beautiful!</p> # <p>It's not so beautiful!</p> #xxx

This is then transformed:

xxx# <p>Es ist so schön!</p> # <p>Es ist eigentlich nicht schön!</p> #xxx

Finally, the original form is restored before the content is sent out:

xxx#Es ist so schön!#Es ist eigentlich nicht schön!#xxx