Structured Text: what it is
and how to handle it in Eclipse

Author	Matitiahu Allouche (matial@il.ibm.com)
Date	2012-02-07
This Version	1.1
Previous Version	1.0

Change History

Version 1.1

Minor editorial changes
Fixed some broken internal links
Added some examples

Version 1.0

First version

1. Introduction
2. Design Overview
3. Specific Cases
4. Eclipse Support for Structured Text Presentation

1. Introduction

1.1 The Need

Languages like Arabic and Hebrew are generally written from right to left, but included numbers and phrases in English must be written from left to right. This is the origin of the term "bidirectional" which qualifies these languages.

In most computer environments, the text is stored in logical order (the order the text is read) but is reordered into visual order for presentation. For plain text, the Unicode Bidirectional Algorithm (UBA) generally specifies satisfactorily how to reorder bidirectional text for display. This algorithm, or close to it, is implemented in the presentation systems of a number of platforms, giving them a good handle on bidirectional support.

However, all bidirectional text is not necessarily plain text. There are also instances of text structured to follow a given syntax, which should be reflected in the display order. The general algorithm, which has no awareness of these special cases, often gives incorrect results when displaying such structured text.

This document describes various examples of this issue, and proposes a methodology to solve the related problems. The types of structured text treated in this document are all excerpted from actual products, including Eclipse.

For a general introduction to bidirectional concepts, the reader is kindly referred to the following technical article: "Bidirectional script support: a primer" available at http://www-128.ibm.com/developerworks/websphere/library/techarticles/bidi/bidigen.html.

1.2 Goal of this Document

The goal of this document is to provide a comprehensive and consistent solution for various cases where bidirectional text must be displayed in a specific way. This document provides a high level design for such cases, based on general principles, and describes how to implement this design using appropriate packages in Eclipse.

1.3 Abbreviations Used in this Document

UBA: Unicode Bidirectional Algorithm
Bidi: Bidirectional
GUI: Graphical User Interface
UI: User Interface
LTR: Left to Right
RTL: Right to Left
LRM: Left-to-Right Mark
RLM: Right-to-Left Mark
LRE: Left-to-Right Embedding
RLE: Right-to-Left Embedding
PDF: Pop Directional Formatting

1.4 Known Limitations

The proposed solution is making extensive usage of LRM, RLM, LRE, RLE and PDF directional controls which are invisible but affect the way bidi text is displayed. The following related key points merit special attention:

Implementations of the UBA on various platforms (e.g., Windows and Linux) are very similar but nevertheless have known differences. Those differences are minor and will not have a visible effect in most cases. However there might be cases in which the same bidi text on two platforms might look different.
This design assumes support for LRE/RLE/PDF controls in the presentation engine.
Because some presentation engines are not strictly conformant to the UBA, this document specifies to add LRM or RLM characters in association with LRE, RLE or PDF in cases where this would not be needed for implementations fully conformant to the UBA. Such added marks will not have harmful effects with conformant implementations and will help less conformant implementations achieve the desired presentation.

2. Design Overview

2.1 General Definitions, Terminology and Conventions

Every instance of bidi text has a base text direction. Bidi text in Arabic or Hebrew has a RTL base direction, even if it includes numbers or Latin phrases which are written from left to right. Bidi text in English or Greek has a LTR base direction, even if it includes Arabic or Hebrew phrases which are written from right to left.

Structured expressions also have a base text direction, which is often determined by the type of structured expression, but may also be affected by the content of the expression (whether it contains Arabic or Hebrew words).

This document addresses two groups of problematic cases:

Expressions with simple internal structure: this category regroups cases in which strings are concatenated together in simple ways using known separators. For example: variable names, "name = value" specifications, file path, etc...
Expressions with complex internal structure: this category regroups structured text like regular expressions, XPath expressions and Java code. This category differs from the previous one since the expressions belonging to it have a unique syntax which cannot be described by concatenation of string segments using separators.

We will see that the same algorithms can handle both groups, with some adaptations in the details.

In the examples appearing in this document, upper case Latin letters represent Arabic or Hebrew text, lower case Latin letters represent English text.

"@" represents an LRM, "&" represents an RLM.

Notations like LRE+LRM represent character LRE immediately followed by character LRM.

2.2 Bidirectional Control Characters

When there are problems of wrong display of bidi text, it is often possible to cure them by adding some bidi control characters at appropriate locations in the text. There are 7 bidi control characters: LRM, RLM, LRE, RLE, LRO, RLO and PDF. Since this design has no use for LRO and RLO (Left-to-Right and Right-to-Left Override, respectively), the following paragraphs will describe the effect of the 5 other characters.

LRM (Left-to-Right Mark): LRM is an invisible character which behaves like a letter in a Left to Right script such as Latin or Greek. It can be used when a segment of LTR text starts or ends with characters which are not intrinsically LTR and is displayed in a component with a RTL orientation.
Example: assume in memory the string "\\myserver\myshare(mydirectory)". We want it displayed identically, but within a component with RTL orientation it would be displayed as "(myserver\myshare(mydirectory\\". Adding one LRM character at the beginning of the string will cause the leading backslashes to be displayed on the left side, and adding one LRM character at the end of the string will cause the trailing parenthesis to be displayed on the right side.
RLM (Right-to-Left Mark): RLM is an invisible character which behaves like a letter in a Right to Left script like Hebrew. It can be used when a segment of RTL text starts or ends with characters which are not intrinsically RTL and is displayed in a component with a LTR orientation.
Example: assume in memory the string "HELLO WORLD !". We want it displayed as "! DLROW OLLEH", but within a component with a LTR orientation it would be displayed as "DLROW OLLEH !" (exclamation mark on the right side). Adding one RLM character at the end of the string will cause the trailing exclamation mark to be displayed on the left side.
LRE (Left-to-Right Embedding): LRE can be used to give a base LTR direction to a piece of text. It is most useful for mixed text which contains both LTR and RTL segments.
Example: assume in memory the string "i love RACHEL and LEA" which should be displayed as "i love LEHCAR and AEL". However, within a component with RTL orientation, it would be displayed as "AEL and LEHCAR i love". Adding one LRE character at the beginning of the string and one PDF (see below) character at the end of the string will cause proper display.
RLE (Right-to-Left Embedding): RLE can be used to give a base RTL direction to a piece of text. It is most useful for mixed text which contains both LTR and RTL segments.
Example: assume in memory the string "I LOVE london AND paris" which should be displayed as "paris DNA london EVOL I". However, within a component with LTR orientation, it would be displayed as "EVOL I london DNA paris". Adding one RLE character at the beginning of the string and adding one PDF (see below) character at the end of the string will cause proper display.
PDF (Pop Directional Formatting): PDF may be used to limit the effect of a preceding LRE or RLE. It may be omitted if not followed by any text.

Note that pieces of text bracketed between LRE/PDF or RLE/PDF can be contained within larger pieces of text themselves bracketed between LRE/PDF or RLE/PDF. This is why the "E" of LRE and RLE means "embedding". This could happen if we have for instance a Hebrew sentence containing an English phrase itself containing an Arabic segment. In practice, such complex cases should be avoided if possible. The present design does not use more than one level of LRE/PDF or RLE/PDF, except possibly in section 3.8 Message with Placeholder.

2.3 Bidi Classification

Characters can be classified according to their bidi type as described in the Unicode Standard (see Bidirectional_Character_Types for a full description of the bidi types). For our purpose, we will distinguish the following types of characters:

"Strong" characters: those with a bidi type of L, R or AL (letters in LTR or RTL scripts);
Numbers: European Numbers (type EN) or Arabic Numbers (type AN);
Neutrals: all the rest.

2.4 Text Analysis

In all the structured expressions that we are addressing, we can see characters with a special syntactical role that we will call "separators", and pieces of text between separators that we will call "tokens". The separators vary according to the type of structured expression. Often they are punctuation signs like colon (:), backslash (\) and full stop (.), or mathematical signs like Plus (+) or Equal (=).

Our objective is that the relative progression of the tokens and separators for display should always follow the base text direction of the text, while each token will go LTR or RTL depending on its content and according to the UBA.

For this to happen, the following must be done:

Parse the expression to locate the separators and the tokens.
While parsing, note the bidi classification of characters parsed.
Depending on the bidi types of the characters before a token and in that token, a LRM or a RLM may have to be added. The algorithm for this is detailed below.
If the expression has a LTR base direction and the component where it is displayed has a RTL orientation, add LRE+LRM at the beginning of the expression and LRM+PDF at its end.
If the expression has a RTL base direction and the component where it is displayed has a LTR orientation, add RLE+RLM at the beginning of the expression and RLM+PDF at its end.

The original structured expression, before addition of directional formatting characters, is called lean text.

The processed expression, after addition of directional formatting characters, is called full text.

2.5 LRM Addition (structured text with LTR base text direction)

A LRM will be added before a token if the following conditions are satisfied:

The last strong character before the token has a bidi type equal to R or AL and the first non-neutral character in the token itself has a bidi type equal to R, AL, EN or AN.

Examples (strings in logical order where "@" represents where an LRM should be added):

   HEBREW @= ARABIC
   HEBREW @= 123

The last non-neutral character before the token has a bidi type equal to AN and the first non-neutral character in the token has a bidi type equal to R, AL or AN.

Examples (strings in logical order where "@" represents where an LRM should be added):

   ARABIC NUMBER 123 @< MAX
   ARABIC NUMBER 123 @< 456

2.6 RLM Addition (structured text with RTL base text direction)

A RLM will be added before a token if the following conditions are satisfied:

The last strong character before the token has a bidi type equal to L and the first non-neutral character in the token itself has a bidi type equal to L or EN.

Example (string in logical order where "&" represents where an RLM should be added):

   my_pet &= dog

3. Specific Cases

In this chapter, we consider in detail a number of specific cases and how the general solution applies to them. We start by discussing simple cases and progressively move to more complex ones.

Since the cases addressed are well known, we don't attempt to give a complete formal definition of their syntax, but rather submit one or more representative symbolic patterns.

The algorithms we propose assume that the text to process is syntactically conform to the pattern it implements. Syntax checking is not in the scope of this document. Incorrect syntax in the data may also lead to anomalies in presentation of bidi text.

Unless specified otherwise, our requirement for presentation in all the cases below is that the relative progression of the tokens and separators for display should always be from left to right, while the text of each token will go LTR or RTL depending on its content and according to the UBA.

3.1 Property File

Pattern

[variable name] = [value]

Limitation: variable names must not include equal signs.

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:

There is only one separator, the equal sign (=).
It is enough to locate the first occurrence of the separator. Everything before can be considered as the first token, everything after can be considered the second (and last) token.

Example:

   Logical order (without LRM):   PRIORITY=5
   Display (without LRM):         5=YTIROIRP
   Logical order (with LRM):      PRIORITY@=5
   Display (without LRM):         YTIROIRP=5

3.2 Compound Name

Pattern

[first part] _ [second part] _ [third part]

Limitation: name parts must not include underscores.

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptation:

There is only one separator, the underscore (_).

Example:

   Logical order (without LRM):   MYPACKAGE_MYPROGRAM
   Display (without LRM):         MARGORPYM_EGAKCAPYM
   Logical order (with LRM):      MYPACKAGE@_MYPROGRAM
   Display (without LRM):         EGAKCAPYM_MARGORPYM

3.3 Comma-delimited List

Pattern

[first list item] , [second list item] , . . . , [last list item]

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:

There is only one separator, the comma(,).
This design can easily be adapted to accomodate a different separator, like a semicolon (;) or a tab character, etc...

Example:

   Logical order (without LRM):   ABC,DE,FGH
   Display (without LRM):         HGF,ED,CBA
   Logical order (with LRM):      ABC@,DE@,FGH
   Display (without LRM):         CBA,ED,HGF

3.4 System, Userid Specification

Pattern

[system ID] ( [user ID] )

Limitation: the system ID must not include parentheses.

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:

It is enough to consider one separator, the left parenthesis ( ( ).
It is enough to locate the first occurrence of the separator. Everything before can be considered as the first token, everything after can be considered the second (and last) token.

Example:

   Logical order (without LRM):   MY_HOST(MY_USERID)
   Display (without LRM):         DIRESU_YM)TSOH_YM)
   Logical order (with LRM):      MY_HOST@(MY_USERID)
   Display (without LRM):         TSOH_YM(DIRESU_YM)

3.5 Full Path - Relative Path - File Name

Patterns

Windows full path: [drive letter]:\ [sub-path] \ . . . \ [sub-path]

Windows relative path: [sub-path] \ . . . \ [sub-path]

Windows full file path: [drive letter]:\ [sub-path] \ . . . \ [sub-path] \ [file name] . [extension]

Windows relative file path: [sub-path] \ . . . \ [sub-path] \ [file name] . [extension]

Linux full path: / [sub-path] / . . . / [sub-path]

Linux relative path: [sub-path] / . . . / [sub-path]

Linux full file path: / [sub-path] / . . . / [sub-path] / [file name] . [extension]

Linux relative file path: [sub-path] / . . . / [sub-path] / [file name] . [extension]

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptation:

The separators are colon (:), backslash (\) and full stop (.) for Windows, slash (/) and full stop (.) for Linux.

Example:

   Logical order (without LRM):   c:\DIR1\DIR2\MYFILE.ext
   Display (without LRM):         c:\ELIFYM\2RID\1RID.ext
   Logical order (with LRM):      c:\DIR1@\DIR2@\MYFILE.ext
   Display (without LRM):         c:\1RID\2RID\ELIFYM.ext

3.6 URL, URI, IRI

Patterns

http:// [domain label] . . . . . [domain label]

http:// [domain label] . . . . . [domain label] / [sub-path] / . . . / [sub-path] / [file name] . [extension]

http:// [domain label] . . . . . [domain label] / [sub-path] / . . . / [sub-path] / [file name] . [extension] # [local reference]

http:// [domain label] . . . . . [domain label] / [sub-path] / . . . / [sub-path] / [file name] . [extension] ? [key1] = [value1] & [key2] = [value2]

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:

The detailed syntax of URLs, URIs, IRIs is described in RFC 3986 and RFC 3987. A rigorous analysis to identify tokens and separators is not simple.
For most practical cases, it is sufficient to consider the following separators: colon (:), question mark (?), number sign (#), slash (/), commercial at (@), full stop (.), left bracket ([), right bracket (]).

Example:

   Logical order (without LRM):   www.DOC.MYDOMAIN.com\HEB\LESSON1.html
   Display (without LRM):         www.NIAMODYM.COD.com\1NOSSEL\BEH.html
   Logical order (with LRM):      www.DOC@.MYDOMAIN.com\HEB@\LESSON1.html
   Display (without LRM):         www.COD.NIAMODYM.com\BEH\1NOSSEL.html

3.7 Mathematical Formula

Requirement

Preserve the relative order of the formula components according to the base text direction of the formula.

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:

The separators are the usual arithmetic operators.
Tokens will be ordered according to the base text direction of the formula.
If the first strong directional character in the formula is a Hebrew or LTR letter, the base text direction of the formula is LTR.
If the first strong directional character in the formula is an Arabic letter, the base direction of the formula must be RTL.
If there is no strong directional character in the formula but there are Arabic-Indic digits, the base direction of the formula must be RTL, otherwise it must be LTR.

Example (Hebrew):

   Logical order (without LRM):   PROFIT = REVENUE - COST
   Display (without LRM):         TSOC - EUNEVER = TIFORP
   Logical order (with LRM):      PROFIT @= REVENUE @- COST
   Display (without LRM):         TIFORP = EUNEVER - TSOC

Example (Arabic, ampersand represents RLM):

   Logical order (without LRM):   DIVIDEND = SHARE x 0.10
   Display (without LRM):         x 0.10 ERAHS = DNEDIVID
   Logical order (with LRM):      DIVIDEND = SHARE x& 0.10
   Display (without LRM):         0.10 x ERAHS = DNEDIVID

3.8 Regular Expression

Requirement

Preserve the relative order of the regular expression components identical to the order in which they appear when exclusively Latin characters are used.

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:

Regular expressions consist of operators, pattern characters, and – in most implementations of extended syntax – named identifiers.
Since the syntax of regular expression is not standardized, the list of operators should be adapted to the specific implementation at hand.
Common operators include: question mark (?), circumflex (^), dollar ($), plus (+), minus (-), asterisk (*), vertical bar (|), tilde (~), left and right parentheses ( ( ) ), left and right square brackets ([ ]), left and right curly brackets ( { } ), commercial at (@), number sign (#), ampersand (&), backslash (\).
The separators will be the characters used as operators for regular expressions.
Characters which are not operators are pattern characters. If an operator is immediately preceded by a backslash, both the backslash and the operator must be handled as pattern characters.
Each pattern character is a separate token, so pattern characters will always be ordered according to the base text direction of the expression.
Identifiers appear in certain syntactic constructs, and are treated as tokens. For example, the strings “digit” and “number” in the expression “total: (?<number>[:digit:]+)\s” are identifiers, whereas “total” is just a sequence of 5 pattern characters.
The following constructs must be recognized as delimiting tokens (note: this list should be adapted to the specific syntax of regular expressions in a given environment):
   (?<name>
   (?'name'
   (?(<name>)
   (?('name')
   (?(name)
   (?&name)
   (?P<name>
   \k<name>
   \k'name'
   \k{name}
   (?P=name)
   \g{name}
   \g<name>
   \g'name'
   (?(R&name)
   [:class:]
Comments of the form (?# . . . ) must be handled as individual tokens.
Quoted sequences of the form \Q . . . \E must be handled as individual tokens.
Numbers used as quantifiers (numbers of occurrences) or as group references must be handled as individual tokens.
If the first strong directional character in a regular expression is an Arabic letter, the base direction of the expression must be RTL.
If the first strong directional character in a regular expression is a Hebrew letter or a LTR letter, the base direction of the expression must be LTR.
If the regular expression contains no strong directional character, its base direction must be LTR for Hebrew users. For Arabic users, its base direction should follow the GUI direction (RTL if mirrored, LTR otherwise).

Example (Hebrew):

   Logical order (without LRM):   ABC(?'DEF'GHI
   Display (without LRM):         IHG'FED'?(CBA
   Logical order (with LRM):      A@B@C@(?'DEF'@G@H@I
   Display (without LRM):         ABC(?'FED'GHI

Example (Arabic):

   Logical order (without LRM):   ABC(?'DEF'GHI
   Display (without LRM):         IHG'FED'?(CBA
   Logical order (with LRM):      ABC(?'DEF'GHI
   Display (without LRM):         IHG'FED'?(CBA

3.9 Java Code

Requirement

We can classify elements of a Java program as:

white space
operators
String literals: they start with a double quote and end with a double quote which is not escaped (not preceded by a backslash).
comments: they start with /* and end with */ or start with // and end at the end of the line.
tokens: anything delimited by the previous items.

The requirement is to make the relative order of elements left-to-right, while each element by itself will be presented according to the UBA.

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:

Each String literal or comment is considered as one token.
The separators are all the characters used as operators and separators in the Java language: plus (+), minus (-), asterisk (*), slash (/), percent (%), less-than (<), greater-than (>), ampersand (&), vertical bar (|), circumflex (^), tilde (~), left and right parentheses ( ( ) ), left and right square brackets ([ ]), left and right curly brackets ( { } ), comma (,), full stop (.), semicolon (;), exclamation mark (!), question mark (?), colon (:), spaces which are not part of a String literal or a comment.
If a String literal or a comment includes LRE or RLE characters but do not include the proper number of matching PDF characters, missing PDF characters must be added at the end of the literal or comment.

Example:

   Logical order (without LRM):   A = /*B+C*/ D;
   Display (without LRM):         D /*C+B*/ = A;
   Logical order (with LRM):      A@ = /*B+C@*/ D;
   Display (without LRM):         A = /*C+B*/ D;

3.10 Other Programming Languages

Other programming languages can be handled like Java, with adaptation to the characteristics of each language.

String literals according to the syntax of the language
Comments according to the syntax of the language
Operators used in the language

3.11 XPath

Patterns

/ book / chapter / paragraph

/ year / month [@name = "April"]

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:

Strings
- Strings are started by a quotation mark which can be a double-quote (") or an apostrophe ('), and are closed by the same character.
- Double-quotes may appear within a string limited by apostrophes and vice versa, and must be handled as characters internal to the string.
- A string started on one line is not necessarily closed on the same line.
Whitespace (e.g. blanks and tab characters) appearing outside of strings constitutes a delimiter for tokens.
Each occurrence of a string must be handled as one token.
After isolating strings, the following characters are separators: white space, slash (/), square brackets ( [ and ] ), less-than (<), greater-than (>), equal sign (=), exclamation mark (!), colon (:), at sign (@), period (.), vertical bar (|), parentheses ( ( and ) ), plus (+), minus (-), asterisk (*).
Some operators are words like "and", "or", "div", "mod". For our purpose, they can be handled as tokens.
Some operators are represented by a pair of symbols like "not equal" (!=), "descendant-or-self" (//), "parent" (..). For our purpose, they can be handled as 2 successive operators represented by one symbol each.

Example:

   Logical order (without LRM):   DEF!GHI 'A!B'=JK
   Display (without LRM):         KJ='B!A' IHG!FED
   Logical order (with LRM):      DEF@!GHI@ 'A!B'@=JK
   Display (without LRM):         FED!IHG 'B!A'=KJ

3.12 Message with Placeholders

Products often use template messages where placeholders are replaced by custom data at run time.

Requirement

The display considerations must ensure correct presentation of both the template text and the custom data replacing the placeholders, taking in account that these data might have an internal structure, which should be preserved.

Detailed Design

The message template will be considered as having a LTR base direction if it is not translated, a RTL base direction if it is translated to Arabic or Hebrew.
Let us call "insertion unit" a piece of custom data which is to replace a placeholder. Insertion units with an internal structure also have a defined base direction, generally LTR. For insertion units without internal structure, their base direction will be defined as RTL if they contain at least one Arabic or Hebrew letter, LTR otherwise.
Each insertion unit with an internal structure must be processed according to its specific structure.
If the base direction of an insertion unit is the same as that of the template, there is nothing more to do for it.
If the base direction of the template is LTR and the base direction of an insertion unit is RTL, the insertion unit should have RLE+RLM added at its beginning and RLM+PDF added at its end.
If the base direction of the template is RTL and the base direction of an insertion unit is LTR, the insertion unit should have LRE+LRM added at its beginning and LRM+PDF added at its end.
If the component in which the formatted message is displayed has an orientation different from the template direction, the formatted message must have LRE+LRM added to its beginning and LRM+PDF added to its end if its base direction is LTR, RLE+RLM added to its beginning and RLM+PDF added to its end if its base direction is RTL.

Example (">" represents LRE, "^" represents PDF, "@" represents LRM):

   Logical order (template without LRM):   err012: FILE "%1" NOT FOUND!
   Logical order (%1 without LRM):         c:\DIR1\MYFILE.ext
   Display (without LRM):                  !DNUOF TON "ext.ELIFYM\1RID\:c" ELIF :err012
   Logical order (with LRM):               err012: FILE ">@c:\DIR1@\MYFILE.ext@^" NOT FOUND!
   Display (without LRM):                  !DNUOF TON "c:\1RID\ELIFYM.ext" ELIF :err012

4. Eclipse Support for Structured Text Presentation

Eclipse provides support for correct presentation of structured text. This support is oriented towards three categories of users:

Regular users who need to transform lean text to full text or vice versa and can accept default parameters for the environment.
Advanced users who need finer control and extra features when transforming lean text to full text or vice versa, or need to specify non-default parameters for the environment.
Developers of processors for new types of structured text.

4.1 Support for Regular Users

This support is provided by the package "org.eclipse.equinox.bidi". It is appropriate when the following conditions are satisfied:

There exists an appropriate handler for the type of the structured text.
There is no need to specify non-default conditions related to the environment.
The only operations needed are to transform lean text into full text or vice versa.
There is no interdependence between the processing of a given string and the processing of preceding or succeeding strings.

In this case, the user will essentially use the methods process() and deprocess() and their variants in class STextProcessor.

4.2 Support for Advanced Users

This support is provided by the package "org.eclipse.equinox.bidi.advanced" and the API specified mainly in its interface ISTextExpert. With this package, the user can:

process types of structured text other than those predefined in Eclipse.
specify a non-default environment.
pass state information between calls to text processing methods.
manage the offsets where directional formatting characters are inserted in the text.

A non-default environment can be instantiated with the class STextEnvironment. One of the items which can be specified is the orientation of the GUI component which will display the structured text. This orientation may have a number of values, and depending on its value and on the base text direction of the structured text, directional formatting characters may be added when transforming lean text to full text, as follows:

When the orientation is ORIENT_LTR and the structured text has a RTL base direction, RLE+RLM will be added at the head of the full text and RLM+PDF at its end.
When the orientation is ORIENT_RTL and the structured text has a LTR base direction, LRE+LRM will be added at the head of the full text and LRM+PDF at its end.
When the orientation is ORIENT_CONTEXTUAL_LTR or ORIENT_CONTEXTUAL_RTL and the data content would resolve to a RTL orientation while the structured text has a LTR base direction, LRM will be added at the head of the full text.
When the orientation is ORIENT_CONTEXTUAL_LTR or ORIENT_CONTEXTUAL_RTL and the data content would resolve to a LTR orientation while the structured text has a RTL base direction, RLM will be added at the head of the full text.
When the orientation is ORIENT_UNKNOWN and the structured text has a LTR base direction, LRE+LRM will be added at the head of the full text and LRM+PDF at its end.
When the orientation is ORIENT_UNKNOWN and the structured text has a RTL base direction, RLE+RLM will be added at the head of the full text and RLM+PDF at its end.
When the orientation is ORIENT_IGNORE, nothing is added as either prefix or suffix of the full text.

4.3 Support for New Types of Structured Text

Developers wishing to create handlers for types of structured text not currently supported by Eclipse "out of the box" will create extensions for class STextTypeHandler in package "org.eclipse.equinox.bidi.advanced". They probably will use also the other classes in this package.

The best way to learn how to write a type handler, beyond the javadoc in the "advanced" package, is to study the code of existing type handlers. Start with very simple ones like STextComma in package "org.eclipse.equinox.bidi.internal.consumable", then proceed to somewhat more complex ones like STextMath and STextEmail, then to the most complex ones like STextJava and StextRegex (all to be found in the "org.eclipse.equinox.bidi.internal.consumable" package). The longest one has less than 300 lines of source code (including comments and blank lines), so that this is a fairly light task.

Plug-ins which implement new types of structured text handlers for general use should register them using the extension point bidiTypes (identifier "org.eclipse.equinox.bidi.bidiTypes" in plugin "org.eclipse.equinox.bidi").

Structured Text: what it is and how to handle it in Eclipse

Change History

Table of Contents

1. Introduction

1.1 The Need

1.2 Goal of this Document

1.3 Abbreviations Used in this Document

1.4 Known Limitations

2. Design Overview

2.1 General Definitions, Terminology and Conventions

2.2 Bidirectional Control Characters

2.3 Bidi Classification

2.4 Text Analysis

2.5 LRM Addition (structured text with LTR base text direction)

2.6 RLM Addition (structured text with RTL base text direction)

3. Specific Cases

3.1 Property File

Pattern

Detailed Design

3.2 Compound Name

Pattern

Detailed Design

3.3 Comma-delimited List

Pattern

Detailed Design

3.4 System, Userid Specification

Pattern

Detailed Design

3.5 Full Path - Relative Path - File Name

Patterns

Detailed Design

3.6 URL, URI, IRI

Patterns

Detailed Design

3.7 Mathematical Formula

Requirement

Detailed Design

3.8 Regular Expression

Requirement

Detailed Design

3.9 Java Code

Requirement

Detailed Design

3.10 Other Programming Languages

3.11 XPath

Patterns

Detailed Design

3.12 Message with Placeholders

Requirement

Detailed Design

4. Eclipse Support for Structured Text Presentation

4.1 Support for Regular Users

4.2 Support for Advanced Users

4.3 Support for New Types of Structured Text

Structured Text: what it is
and how to handle it in Eclipse