Author | Matitiahu Allouche (matial@il.ibm.com) |
Date | 2012-02-07 |
This Version | 1.1 |
Previous Version | 1.0 |
Languages like Arabic and Hebrew are generally written from right to left, but included numbers and phrases in English must be written from left to right. This is the origin of the term "bidirectional" which qualifies these languages.
In most computer environments, the text is stored in logical order (the order the text is read) but is reordered into visual order for presentation. For plain text, the Unicode Bidirectional Algorithm (UBA) generally specifies satisfactorily how to reorder bidirectional text for display. This algorithm, or close to it, is implemented in the presentation systems of a number of platforms, giving them a good handle on bidirectional support.
However, all bidirectional text is not necessarily plain text. There are also instances of text structured to follow a given syntax, which should be reflected in the display order. The general algorithm, which has no awareness of these special cases, often gives incorrect results when displaying such structured text.
This document describes various examples of this issue, and proposes a methodology to solve the related problems. The types of structured text treated in this document are all excerpted from actual products, including Eclipse.
For a general introduction to bidirectional concepts, the reader is kindly referred to the following technical article: "Bidirectional script support: a primer" available at http://www-128.ibm.com/developerworks/websphere/library/techarticles/bidi/bidigen.html.
The goal of this document is to provide a comprehensive and consistent solution for various cases where bidirectional text must be displayed in a specific way. This document provides a high level design for such cases, based on general principles, and describes how to implement this design using appropriate packages in Eclipse.
The proposed solution is making extensive usage of LRM, RLM, LRE, RLE and PDF directional controls which are invisible but affect the way bidi text is displayed. The following related key points merit special attention:
Every instance of bidi text has a base text direction. Bidi text in Arabic or Hebrew has a RTL base direction, even if it includes numbers or Latin phrases which are written from left to right. Bidi text in English or Greek has a LTR base direction, even if it includes Arabic or Hebrew phrases which are written from right to left.
Structured expressions also have a base text direction, which is often determined by the type of structured expression, but may also be affected by the content of the expression (whether it contains Arabic or Hebrew words).
This document addresses two groups of problematic cases:
We will see that the same algorithms can handle both groups, with some adaptations in the details.
In the examples appearing in this document, upper case Latin letters represent Arabic or Hebrew text, lower case Latin letters represent English text.
"@" represents an LRM, "&" represents an RLM.
Notations like LRE+LRM represent character LRE immediately followed by character LRM.
When there are problems of wrong display of bidi text, it is often possible to cure them by adding some bidi control characters at appropriate locations in the text. There are 7 bidi control characters: LRM, RLM, LRE, RLE, LRO, RLO and PDF. Since this design has no use for LRO and RLO (Left-to-Right and Right-to-Left Override, respectively), the following paragraphs will describe the effect of the 5 other characters.
Note that pieces of text bracketed between LRE/PDF or RLE/PDF can be contained within larger pieces of text themselves bracketed between LRE/PDF or RLE/PDF. This is why the "E" of LRE and RLE means "embedding". This could happen if we have for instance a Hebrew sentence containing an English phrase itself containing an Arabic segment. In practice, such complex cases should be avoided if possible. The present design does not use more than one level of LRE/PDF or RLE/PDF, except possibly in section 3.8 Message with Placeholder.
Characters can be classified according to their bidi type as described in the Unicode Standard (see Bidirectional_Character_Types for a full description of the bidi types). For our purpose, we will distinguish the following types of characters:
In all the structured expressions that we are addressing, we can see characters with a special syntactical role that we will call "separators", and pieces of text between separators that we will call "tokens". The separators vary according to the type of structured expression. Often they are punctuation signs like colon (:), backslash (\) and full stop (.), or mathematical signs like Plus (+) or Equal (=).
Our objective is that the relative progression of the tokens and separators for display should always follow the base text direction of the text, while each token will go LTR or RTL depending on its content and according to the UBA.
For this to happen, the following must be done:
The original structured expression, before addition of directional formatting characters, is called lean text.
The processed expression, after addition of directional formatting characters, is called full text.
A LRM will be added before a token if the following conditions are satisfied:
Examples (strings in logical order where "@" represents where an LRM should be added):
HEBREW @= ARABIC HEBREW @= 123
OR
Examples (strings in logical order where "@" represents where an LRM should be added):
ARABIC NUMBER 123 @< MAX ARABIC NUMBER 123 @< 456
A RLM will be added before a token if the following conditions are satisfied:
Example (string in logical order where "&" represents where an RLM should be added):
my_pet &= dog
In this chapter, we consider in detail a number of specific cases and how the general solution applies to them. We start by discussing simple cases and progressively move to more complex ones.
Since the cases addressed are well known, we don't attempt to give a complete formal definition of their syntax, but rather submit one or more representative symbolic patterns.
The algorithms we propose assume that the text to process is syntactically conform to the pattern it implements. Syntax checking is not in the scope of this document. Incorrect syntax in the data may also lead to anomalies in presentation of bidi text.
Unless specified otherwise, our requirement for presentation in all the cases below is that the relative progression of the tokens and separators for display should always be from left to right, while the text of each token will go LTR or RTL depending on its content and according to the UBA.
[variable name] = [value]
Limitation: variable names must not include equal signs.
The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:
Logical order (without LRM): PRIORITY=5 Display (without LRM): 5=YTIROIRP Logical order (with LRM): PRIORITY@=5 Display (without LRM): YTIROIRP=5
[first part] _ [second part] _ [third part]
Limitation: name parts must not include underscores.
The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptation:
Logical order (without LRM): MYPACKAGE_MYPROGRAM Display (without LRM): MARGORPYM_EGAKCAPYM Logical order (with LRM): MYPACKAGE@_MYPROGRAM Display (without LRM): EGAKCAPYM_MARGORPYM
[first list item] , [second list item] , . . . , [last list item]
The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:
Logical order (without LRM): ABC,DE,FGH Display (without LRM): HGF,ED,CBA Logical order (with LRM): ABC@,DE@,FGH Display (without LRM): CBA,ED,HGF
[system ID] ( [user ID] )
Limitation: the system ID must not include parentheses.
The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:
Logical order (without LRM): MY_HOST(MY_USERID) Display (without LRM): DIRESU_YM)TSOH_YM) Logical order (with LRM): MY_HOST@(MY_USERID) Display (without LRM): TSOH_YM(DIRESU_YM)
Windows full path: [drive letter]:\ [sub-path] \ . . . \ [sub-path]
Windows relative path: [sub-path] \ . . . \ [sub-path]
Windows full file path: [drive letter]:\ [sub-path] \ . . . \ [sub-path] \ [file name] . [extension]
Windows relative file path: [sub-path] \ . . . \ [sub-path] \ [file name] . [extension]
Linux full path: / [sub-path] / . . . / [sub-path]
Linux relative path: [sub-path] / . . . / [sub-path]
Linux full file path: / [sub-path] / . . . / [sub-path] / [file name] . [extension]
Linux relative file path: [sub-path] / . . . / [sub-path] / [file name] . [extension]
The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptation:
Logical order (without LRM): c:\DIR1\DIR2\MYFILE.ext Display (without LRM): c:\ELIFYM\2RID\1RID.ext Logical order (with LRM): c:\DIR1@\DIR2@\MYFILE.ext Display (without LRM): c:\1RID\2RID\ELIFYM.ext
http:// [domain label] . . . . . [domain label]
http:// [domain label] . . . . . [domain label] / [sub-path] / . . . / [sub-path] / [file name] . [extension]
http:// [domain label] . . . . . [domain label] / [sub-path] / . . . / [sub-path] / [file name] . [extension] # [local reference]
http:// [domain label] . . . . . [domain label] / [sub-path] / . . . / [sub-path] / [file name] . [extension] ? [key1] = [value1] & [key2] = [value2]
The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:
Logical order (without LRM): www.DOC.MYDOMAIN.com\HEB\LESSON1.html Display (without LRM): www.NIAMODYM.COD.com\1NOSSEL\BEH.html Logical order (with LRM): www.DOC@.MYDOMAIN.com\HEB@\LESSON1.html Display (without LRM): www.COD.NIAMODYM.com\BEH\1NOSSEL.html
Preserve the relative order of the formula components according to the base text direction of the formula.
The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:
Logical order (without LRM): PROFIT = REVENUE - COST Display (without LRM): TSOC - EUNEVER = TIFORP Logical order (with LRM): PROFIT @= REVENUE @- COST Display (without LRM): TIFORP = EUNEVER - TSOCExample (Arabic, ampersand represents RLM):
Logical order (without LRM): DIVIDEND = SHARE x 0.10 Display (without LRM): x 0.10 ERAHS = DNEDIVID Logical order (with LRM): DIVIDEND = SHARE x& 0.10 Display (without LRM): 0.10 x ERAHS = DNEDIVID
Preserve the relative order of the regular expression components identical to the order in which they appear when exclusively Latin characters are used.
The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:
Logical order (without LRM): ABC(?'DEF'GHI Display (without LRM): IHG'FED'?(CBA Logical order (with LRM): A@B@C@(?'DEF'@G@H@I Display (without LRM): ABC(?'FED'GHIExample (Arabic):
Logical order (without LRM): ABC(?'DEF'GHI Display (without LRM): IHG'FED'?(CBA Logical order (with LRM): ABC(?'DEF'GHI Display (without LRM): IHG'FED'?(CBA
We can classify elements of a Java program as:
The requirement is to make the relative order of elements left-to-right, while each element by itself will be presented according to the UBA.
The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:
Logical order (without LRM): A = /*B+C*/ D; Display (without LRM): D /*C+B*/ = A; Logical order (with LRM): A@ = /*B+C@*/ D; Display (without LRM): A = /*C+B*/ D;
Other programming languages can be handled like Java, with adaptation to the characteristics of each language.
/ book / chapter / paragraph
/ year / month [@name = "April"]
The general algorithm described in above sections 2.4 Text Analysis and 2.5 LRM Addition applies, with the following adaptations:
Logical order (without LRM): DEF!GHI 'A!B'=JK Display (without LRM): KJ='B!A' IHG!FED Logical order (with LRM): DEF@!GHI@ 'A!B'@=JK Display (without LRM): FED!IHG 'B!A'=KJ
Products often use template messages where placeholders are replaced by custom data at run time.
The display considerations must ensure correct presentation of both the template text and the custom data replacing the placeholders, taking in account that these data might have an internal structure, which should be preserved.
Logical order (template without LRM): err012: FILE "%1" NOT FOUND! Logical order (%1 without LRM): c:\DIR1\MYFILE.ext Display (without LRM): !DNUOF TON "ext.ELIFYM\1RID\:c" ELIF :err012 Logical order (with LRM): err012: FILE ">@c:\DIR1@\MYFILE.ext@^" NOT FOUND! Display (without LRM): !DNUOF TON "c:\1RID\ELIFYM.ext" ELIF :err012
Eclipse provides support for correct presentation of structured text. This support is oriented towards three categories of users:
This support is provided by the package "org.eclipse.equinox.bidi". It is appropriate when the following conditions are satisfied:
In this case, the user will essentially use the methods
process()
and deprocess()
and their variants
in class STextProcessor.
This support is provided by the package "org.eclipse.equinox.bidi.advanced" and the API specified mainly in its interface ISTextExpert. With this package, the user can:
A non-default environment can be instantiated with the class STextEnvironment. One of the items which can be specified is the orientation of the GUI component which will display the structured text. This orientation may have a number of values, and depending on its value and on the base text direction of the structured text, directional formatting characters may be added when transforming lean text to full text, as follows:
ORIENT_LTR
and the
structured text has a RTL base direction, RLE+RLM will be added at the
head of the full text and RLM+PDF at its end.ORIENT_RTL
and the
structured text has a LTR base direction, LRE+LRM will be added at the
head of the full text and LRM+PDF at its end.ORIENT_CONTEXTUAL_LTR
or
ORIENT_CONTEXTUAL_RTL
and the data content would resolve
to a RTL orientation while the structured text has a LTR base
direction, LRM will be added at the head of the full text.ORIENT_CONTEXTUAL_LTR
or
ORIENT_CONTEXTUAL_RTL
and the data content would resolve
to a LTR orientation while the structured text has a RTL base
direction, RLM will be added at the head of the full text.ORIENT_UNKNOWN
and the
structured text has a LTR base direction,
LRE+LRM will be added at the head of the full text and
LRM+PDF at its end.ORIENT_UNKNOWN
and the
structured text has a RTL base direction,
RLE+RLM will be added at the head of the full text and
RLM+PDF at its end.ORIENT_IGNORE
,
nothing is added as either prefix or suffix of the full text.
Developers wishing to create handlers for types of structured text not currently supported by Eclipse "out of the box" will create extensions for class STextTypeHandler in package "org.eclipse.equinox.bidi.advanced". They probably will use also the other classes in this package.
The best way to learn how to write a type handler, beyond the javadoc in the "advanced" package, is to study the code of existing type handlers. Start with very simple ones like STextComma in package "org.eclipse.equinox.bidi.internal.consumable", then proceed to somewhat more complex ones like STextMath and STextEmail, then to the most complex ones like STextJava and StextRegex (all to be found in the "org.eclipse.equinox.bidi.internal.consumable" package). The longest one has less than 300 lines of source code (including comments and blank lines), so that this is a fairly light task.
Plug-ins which implement new types of structured text handlers for general use should register them using the extension point bidiTypes (identifier "org.eclipse.equinox.bidi.bidiTypes" in plugin "org.eclipse.equinox.bidi").