Structured Text: what it is
and how to handle it in Eclipse

Author Matitiahu Allouche (matial@il.ibm.com)
Date 2012-02-07
This Version 1.1
Previous Version 1.0

 

 

Change History

Version 1.1
Version 1.0

 

 

Table of Contents

 

1. Introduction

 

1.1 The Need

Languages like Arabic and Hebrew are generally written from right to left, but included numbers and phrases in English must be written from left to right. This is the origin of the term "bidirectional" which qualifies these languages.

In most computer environments, the text is stored in logical order (the order the text is read) but is reordered into visual order for presentation. For plain text, the Unicode Bidirectional Algorithm (UBA) generally specifies satisfactorily how to reorder bidirectional text for display. This algorithm, or close to it, is implemented in the presentation systems of a number of platforms, giving them a good handle on bidirectional support.

However, all bidirectional text is not necessarily plain text. There are also instances of text structured to follow a given syntax, which should be reflected in the display order. The general algorithm, which has no awareness of these special cases, often gives incorrect results when displaying such structured text.

This document describes various examples of this issue, and proposes a methodology to solve the related problems. The types of structured text treated in this document are all excerpted from actual products, including Eclipse.

For a general introduction to bidirectional concepts, the reader is kindly referred to the following technical article: "Bidirectional script support: a primer" available at http://www-128.ibm.com/developerworks/websphere/library/techarticles/bidi/bidigen.html.

 

1.2 Goal of this Document

The goal of this document is to provide a comprehensive and consistent solution for various cases where bidirectional text must be displayed in a specific way. This document provides a high level design for such cases, based on general principles, and describes how to implement this design using appropriate packages in Eclipse.

 

1.3 Abbreviations Used in this Document

UBA
Unicode Bidirectional Algorithm
Bidi
Bidirectional
GUI
Graphical User Interface
UI
User Interface
LTR
Left to Right
RTL
Right to Left
LRM
Left-to-Right Mark
RLM
Right-to-Left Mark
LRE
Left-to-Right Embedding
RLE
Right-to-Left Embedding
PDF
Pop Directional Formatting

 

1.4 Known Limitations

The proposed solution is making extensive usage of LRM, RLM, LRE, RLE and PDF directional controls which are invisible but affect the way bidi text is displayed. The following related key points merit special attention:

 

2. Design Overview

2.1 General Definitions, Terminology and Conventions

Every instance of bidi text has a base text direction. Bidi text in Arabic or Hebrew has a RTL base direction, even if it includes numbers or Latin phrases which are written from left to right. Bidi text in English or Greek has a LTR base direction, even if it includes Arabic or Hebrew phrases which are written from right to left.

Structured expressions also have a base text direction, which is often determined by the type of structured expression, but may also be affected by the content of the expression (whether it contains Arabic or Hebrew words).

This document addresses two groups of problematic cases:

  1. Expressions with simple internal structure: this category regroups cases in which strings are concatenated together in simple ways using known separators. For example: variable names, "name = value" specifications, file path, etc...
     
  2. Expressions with complex internal structure: this category regroups structured text like regular expressions, XPath expressions and Java code. This category differs from the previous one since the expressions belonging to it have a unique syntax which cannot be described by concatenation of string segments using separators.

We will see that the same algorithms can handle both groups, with some adaptations in the details.

In the examples appearing in this document, upper case Latin letters represent Arabic or Hebrew text, lower case Latin letters represent English text.

"@" represents an LRM, "&" represents an RLM.

Notations like LRE+LRM represent character LRE immediately followed by character LRM.

 

2.2 Bidirectional Control Characters

When there are problems of wrong display of bidi text, it is often possible to cure them by adding some bidi control characters at appropriate locations in the text. There are 7 bidi control characters: LRM, RLM, LRE, RLE, LRO, RLO and PDF. Since this design has no use for LRO and RLO (Left-to-Right and Right-to-Left Override, respectively), the following paragraphs will describe the effect of the 5 other characters.

Note that pieces of text bracketed between LRE/PDF or RLE/PDF can be contained within larger pieces of text themselves bracketed between LRE/PDF or RLE/PDF. This is why the "E" of LRE and RLE means "embedding". This could happen if we have for instance a Hebrew sentence containing an English phrase itself containing an Arabic segment. In practice, such complex cases should be avoided if possible. The present design does not use more than one level of LRE/PDF or RLE/PDF, except possibly in section 3.8 Message with Placeholder.

 

2.3 Bidi Classification

Characters can be classified according to their bidi type as described in the Unicode Standard (see Bidirectional_Character_Types for a full description of the bidi types). For our purpose, we will distinguish the following types of characters:

 

2.4 Text Analysis

In all the structured expressions that we are addressing, we can see characters with a special syntactical role that we will call "separators", and pieces of text between separators that we will call "tokens". The separators vary according to the type of structured expression. Often they are punctuation signs like colon (:), backslash (\) and full stop (.), or mathematical signs like Plus (+) or Equal (=).

Our objective is that the relative progression of the tokens and separators for display should always follow the base text direction of the text, while each token will go LTR or RTL depending on its content and according to the UBA.

For this to happen, the following must be done:

  1. Parse the expression to locate the separators and the tokens.
     
  2. While parsing, note the bidi classification of characters parsed.
     
  3. Depending on the bidi types of the characters before a token and in that token, a LRM or a RLM may have to be added. The algorithm for this is detailed below.
     
  4. If the expression has a LTR base direction and the component where it is displayed has a RTL orientation, add LRE+LRM at the beginning of the expression and LRM+PDF at its end.
     
  5. If the expression has a RTL base direction and the component where it is displayed has a LTR orientation, add RLE+RLM at the beginning of the expression and RLM+PDF at its end.
     

The original structured expression, before addition of directional formatting characters, is called lean text.

The processed expression, after addition of directional formatting characters, is called full text.

 

2.5 LRM Addition (structured text with LTR base text direction)

A LRM will be added before a token if the following conditions are satisfied:

Examples (strings in logical order where "@" represents where an LRM should be added):

   HEBREW @= ARABIC
   HEBREW @= 123

OR

Examples (strings in logical order where "@" represents where an LRM should be added):

   ARABIC NUMBER 123 @< MAX
   ARABIC NUMBER 123 @< 456

 

2.6 RLM Addition (structured text with RTL base text direction)

A RLM will be added before a token if the following conditions are satisfied:

Example (string in logical order where "&" represents where an RLM should be added):

   my_pet &= dog

 

3. Specific Cases

In this chapter, we consider in detail a number of specific cases and how the general solution applies to them. We start by discussing simple cases and progressively move to more complex ones.

Since the cases addressed are well known, we don't attempt to give a complete formal definition of their syntax, but rather submit one or more representative symbolic patterns.

The algorithms we propose assume that the text to process is syntactically conform to the pattern it implements. Syntax checking is not in the scope of this document. Incorrect syntax in the data may also lead to anomalies in presentation of bidi text.

Unless specified otherwise, our requirement for presentation in all the cases below is that the relative progression of the tokens and separators for display should always be from left to right, while the text of each token will go LTR or RTL depending on its content and according to the UBA.

 

3.1 Property File

Pattern

[variable name] = [value]

Limitation: variable names must not include equal signs.

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:

Example:
   Logical order (without LRM):   PRIORITY=5
   Display (without LRM):         5=YTIROIRP
   Logical order (with LRM):      PRIORITY@=5
   Display (without LRM):         YTIROIRP=5

 

3.2 Compound Name

Pattern

[first part] _ [second part] _ [third part]

Limitation: name parts must not include underscores.

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptation:

Example:
   Logical order (without LRM):   MYPACKAGE_MYPROGRAM
   Display (without LRM):         MARGORPYM_EGAKCAPYM
   Logical order (with LRM):      MYPACKAGE@_MYPROGRAM
   Display (without LRM):         EGAKCAPYM_MARGORPYM

 

3.3 Comma-delimited List

Pattern

[first list item] , [second list item] , . . . , [last list item]

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:

Example:
   Logical order (without LRM):   ABC,DE,FGH
   Display (without LRM):         HGF,ED,CBA
   Logical order (with LRM):      ABC@,DE@,FGH
   Display (without LRM):         CBA,ED,HGF

 

3.4 System, Userid Specification

Pattern

[system ID] ( [user ID] )

Limitation: the system ID must not include parentheses.

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:

Example:
   Logical order (without LRM):   MY_HOST(MY_USERID)
   Display (without LRM):         DIRESU_YM)TSOH_YM)
   Logical order (with LRM):      MY_HOST@(MY_USERID)
   Display (without LRM):         TSOH_YM(DIRESU_YM)

 

3.5 Full Path - Relative Path - File Name

Patterns

Windows full path: [drive letter]:\ [sub-path] \ . . . \ [sub-path]

Windows relative path: [sub-path] \ . . . \ [sub-path]

Windows full file path: [drive letter]:\ [sub-path] \ . . . \ [sub-path] \ [file name] . [extension]

Windows relative file path: [sub-path] \ . . . \ [sub-path] \ [file name] . [extension]

Linux full path: / [sub-path] / . . . / [sub-path]

Linux relative path: [sub-path] / . . . / [sub-path]

Linux full file path: / [sub-path] / . . . / [sub-path] / [file name] . [extension]

Linux relative file path: [sub-path] / . . . / [sub-path] / [file name] . [extension]

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptation:

Example:
   Logical order (without LRM):   c:\DIR1\DIR2\MYFILE.ext
   Display (without LRM):         c:\ELIFYM\2RID\1RID.ext
   Logical order (with LRM):      c:\DIR1@\DIR2@\MYFILE.ext
   Display (without LRM):         c:\1RID\2RID\ELIFYM.ext

 

3.6 URL, URI, IRI

Patterns

http:// [domain label] . . . . . [domain label]

http:// [domain label] . . . . . [domain label] / [sub-path] / . . . / [sub-path] / [file name] . [extension]

http:// [domain label] . . . . . [domain label] / [sub-path] / . . . / [sub-path] / [file name] . [extension] # [local reference]

http:// [domain label] . . . . . [domain label] / [sub-path] / . . . / [sub-path] / [file name] . [extension] ? [key1] = [value1] & [key2] = [value2]

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:

Example:
   Logical order (without LRM):   www.DOC.MYDOMAIN.com\HEB\LESSON1.html
   Display (without LRM):         www.NIAMODYM.COD.com\1NOSSEL\BEH.html
   Logical order (with LRM):      www.DOC@.MYDOMAIN.com\HEB@\LESSON1.html
   Display (without LRM):         www.COD.NIAMODYM.com\BEH\1NOSSEL.html

 

3.7 Mathematical Formula

Requirement

Preserve the relative order of the formula components according to the base text direction of the formula.

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:

Example (Hebrew):
   Logical order (without LRM):   PROFIT = REVENUE - COST
   Display (without LRM):         TSOC - EUNEVER = TIFORP
   Logical order (with LRM):      PROFIT @= REVENUE @- COST
   Display (without LRM):         TIFORP = EUNEVER - TSOC
Example (Arabic, ampersand represents RLM):
   Logical order (without LRM):   DIVIDEND = SHARE x 0.10
   Display (without LRM):         x 0.10 ERAHS = DNEDIVID
   Logical order (with LRM):      DIVIDEND = SHARE x& 0.10
   Display (without LRM):         0.10 x ERAHS = DNEDIVID

 

3.8 Regular Expression

Requirement

Preserve the relative order of the regular expression components identical to the order in which they appear when exclusively Latin characters are used.

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:

Example (Hebrew):
   Logical order (without LRM):   ABC(?'DEF'GHI
   Display (without LRM):         IHG'FED'?(CBA
   Logical order (with LRM):      A@B@C@(?'DEF'@G@H@I
   Display (without LRM):         ABC(?'FED'GHI
Example (Arabic):
   Logical order (without LRM):   ABC(?'DEF'GHI
   Display (without LRM):         IHG'FED'?(CBA
   Logical order (with LRM):      ABC(?'DEF'GHI
   Display (without LRM):         IHG'FED'?(CBA

 

3.9 Java Code

Requirement

We can classify elements of a Java program as:

The requirement is to make the relative order of elements left-to-right, while each element by itself will be presented according to the UBA.

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:

Example:
   Logical order (without LRM):   A = /*B+C*/ D;
   Display (without LRM):         D /*C+B*/ = A;
   Logical order (with LRM):      A@ = /*B+C@*/ D;
   Display (without LRM):         A = /*C+B*/ D;

 

3.10 Other Programming Languages

Other programming languages can be handled like Java, with adaptation to the characteristics of each language.

 

3.11 XPath

Patterns

/ book / chapter / paragraph

/ year / month [@name = "April"]

Detailed Design

The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:

Example:
   Logical order (without LRM):   DEF!GHI 'A!B'=JK
   Display (without LRM):         KJ='B!A' IHG!FED
   Logical order (with LRM):      DEF@!GHI@ 'A!B'@=JK
   Display (without LRM):         FED!IHG 'B!A'=KJ

 

3.12 Message with Placeholders

Products often use template messages where placeholders are replaced by custom data at run time.

Requirement

The display considerations must ensure correct presentation of both the template text and the custom data replacing the placeholders, taking in account that these data might have an internal structure, which should be preserved.

Detailed Design

  1. The message template will be considered as having a LTR base direction if it is not translated, a RTL base direction if it is translated to Arabic or Hebrew.
  2. Let us call "insertion unit" a piece of custom data which is to replace a placeholder. Insertion units with an internal structure also have a defined base direction, generally LTR. For insertion units without internal structure, their base direction will be defined as RTL if they contain at least one Arabic or Hebrew letter, LTR otherwise.
  3.  Each insertion unit with an internal structure must be processed according to its specific structure.
  4. If the base direction of an insertion unit is the same as that of the template, there is nothing more to do for it.
  5. If the base direction of the template is LTR and the base direction of an insertion unit is RTL, the insertion unit should have RLE+RLM added at its beginning and RLM+PDF added at its end.
  6. If the base direction of the template is RTL and the base direction of an insertion unit is LTR, the insertion unit should have LRE+LRM added at its beginning and LRM+PDF added at its end.
  7. If the component in which the formatted message is displayed has an orientation different from the template direction, the formatted message must have LRE+LRM added to its beginning and LRM+PDF added to its end if its base direction is LTR, RLE+RLM added to its beginning and RLM+PDF added to its end if its base direction is RTL.
Example (">" represents LRE, "^" represents PDF, "@" represents LRM):
   Logical order (template without LRM):   err012: FILE "%1" NOT FOUND!
   Logical order (%1 without LRM):         c:\DIR1\MYFILE.ext
   Display (without LRM):                  !DNUOF TON "ext.ELIFYM\1RID\:c" ELIF :err012
   Logical order (with LRM):               err012: FILE ">@c:\DIR1@\MYFILE.ext@^" NOT FOUND!
   Display (without LRM):                  !DNUOF TON "c:\1RID\ELIFYM.ext" ELIF :err012

 

4. Eclipse Support for Structured Text Presentation

Eclipse provides support for correct presentation of structured text. This support is oriented towards three categories of users:

 

4.1 Support for Regular Users

This support is provided by the package "org.eclipse.equinox.bidi". It is appropriate when the following conditions are satisfied:

In this case, the user will essentially use the methods process() and deprocess() and their variants in class STextProcessor.

 

4.2 Support for Advanced Users

This support is provided by the package "org.eclipse.equinox.bidi.advanced" and the API specified mainly in its interface ISTextExpert. With this package, the user can:

A non-default environment can be instantiated with the class STextEnvironment. One of the items which can be specified is the orientation of the GUI component which will display the structured text. This orientation may have a number of values, and depending on its value and on the base text direction of the structured text, directional formatting characters may be added when transforming lean text to full text, as follows:

 

4.3 Support for New Types of Structured Text

Developers wishing to create handlers for types of structured text not currently supported by Eclipse "out of the box" will create extensions for class STextTypeHandler in package "org.eclipse.equinox.bidi.advanced". They probably will use also the other classes in this package.

The best way to learn how to write a type handler, beyond the javadoc in the "advanced" package, is to study the code of existing type handlers. Start with very simple ones like STextComma in package "org.eclipse.equinox.bidi.internal.consumable", then proceed to somewhat more complex ones like STextMath and STextEmail, then to the most complex ones like STextJava and StextRegex (all to be found in the "org.eclipse.equinox.bidi.internal.consumable" package). The longest one has less than 300 lines of source code (including comments and blank lines), so that this is a fairly light task.

Plug-ins which implement new types of structured text handlers for general use should register them using the extension point bidiTypes (identifier "org.eclipse.equinox.bidi.bidiTypes" in plugin "org.eclipse.equinox.bidi").