Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
RE: [hyades-dev] More info on Java UTF-8

There is still some debate about using UTF-8 as a canonical string format, 
and a suggestion that we should use a pluggable transcoder stack to handle 
impedance matching among components. I don't think we should do that. I 
think it will benefit us to pick UTF-8 as a canonical string format, just 
like we will pick a canonical byte order for multi-byte numeric values. I 
will try to lay out my logic and see if people agree.

First let me say that I know Java and .Net aren't the only target 
environments, and Eclipse/Java isn't the only client environment. I think 
they're going to be high-runner cases. While we must not make things 
impossibly complex for a non-Java agent or client, I think we can consider 
the common cases when deciding which way to streamline things.

Second I'll say this: I'm arguing against us inventing a pluggable 
transcoding stack of our own. My resistance is based on the added 
complexity and the difficulty of getting the design and implementation 
right. If we find and adopt an established standard message system that 
already has such a feature, and has an implementation on the platforms of 
interest, and meets our other criteria, that's OK with me.

Third, I want to comment that I've just finished making my Java agent 
extension component work on z/OS and OS/400; Java uses Unicode, and the 
RAC on z/OS wants ASCII while the one on AS/400 wants EBCDIC. Different 
command-line options and pragmas in C source are used to control ASCII vs. 
EBCDIC string constants. My battle scars are fresh.

Now that the preliminaries are over, I'll start with the bones of my 
argument, then flesh them out: 

        1. A transcoding stack is beneficial in a limited range of 
scenarios, compared to picking UTF-8 as a common format.

        2. The use of the HCE protocol will be almost nonexistent the 
scenario where such a stack is beneficial.

        3. A transcoding stack adds noticeable complexity to the design 
and implementation, so it should be done only if justified.

        4. I believe it is not justified. Therefore we should use UTF-8, 
and not invent and use a transcoding stack.

Point 1: when does a transcoding stack help you? It saves useless 
transcoding when two components with identical encodings want to talk. 
Let's say the only two encodings of interest are UTF-8 and EBCDIC. (We can 
consider ASCII a subset of UTF-8.) If we pick UTF-8 as the canonical 
encoding, then two EBCDIC components that want to talk will go through two 
useless conversion steps: EBCDIC to UTF-8 and back again. That's where a 
transcoding stack wins: it recognizes that no conversion is necessary.

But that's the ONLY scenario where the stack is helpful compared to 
picking UTF-8 as the canonical format. Two interacting UTF-8 components 
don't need to transcode in either case, and mismatched components must go 
through one transcode step regardless.

And look again: I think this team acknowledges that transcoding is not a 
big deal for low-volume traffic. So the only real benefit scenario is 
*high-volume* communication between two EBCDIC agents.

And here's something else: even EBCDIC agents will want to be in a Unicode 
universe if they're using strings that come from a Java or .Net process - 
strings like package and names, or log messages. I know that's not the 
only kind of agent, but it's a kind worth considering. Therefore even 
conversations among local components on an EBCDIC platform might want to 
use a Unicode-capable format like UTF-8.

Point 2: We've established that the only winning scenario is high-volume 
communication between EBCDIC components that don't need to operate in a 
Unicode world. We have to ask ourselves, how likely is that? While it may 
be that the Workbench isn't the only client, it is certainly the major 
one. Even considering non-Workbench clients, they will overwhelmingly be 
running on non-EBCDIC machines, even when there are EBCDIC machines in the 
system being observed. Finally, in order to matter, the communication has 
to involve high-volume STRINGS: passing numbers and binary data has no 
bearing on this argument.

It would be easier to assign a "likelihood" value if we knew what the use 
cases were for agent-to-agent interactions. Right now we don't have such 
use cases in front of us. Use cases are valuable for settling questions 
like this and for documenting the foundation for a decision so we don't 
constantly revisit it.

Point 3: The transcoding stack is heavy-weight. It adds complexity and 
risk to the project. We'll have to define a communication broker that all 
messages go through. That broker will need a plug-in architecture for 
transcoders. Each transcoder registers the input and output formats it 
knows how to transform. Every agent (or even every message in the system) 
will be tagged with the string encoding it uses. As messages pass through 
the broker, it will determine whether the sender and recipient are 
compatible. If not, it will consult its roster of transcoders and find one 
(or possibly more than one) to transform the sender's encoding to the 
recipient's. 

Then there are a million details: who loads transcoders? When? Reading 
information from what registry store, using what mechanism? Are they 
native shared libraries, Java classes, something else, or possibly all of 
the above? Can they be unloaded when they're not needed any more?

All that work, all that complexity, implementation, and testing for a 
broker that almost always sees UTF-8 to UTF-8 or UTF-8 to EBCDIC, which 
cause it to do the same transcode we would have done if we'd made the 
simple choice. That's what I see. To keep this new HCE protocol system 
achievable, I think this is added risk and complexity we can live without.

-- Allan Pratt, apratt@xxxxxxxxxx
Rational software division of IBM



Back to the top