Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [hyades-dev] UTF-8 as the data exchange format to and from the data collection engine

I'm not sure I understand the last point, "This is expected to be a rare 
case if the data are expected to be in ASCII format and the client can 
ignore this exception." What's rare? What's ASCII? After reading this 
twice I *think* you're saying "the two-byte encoding of interior nulls is 
a difference we can ignore, since none of the strings in the protocol have 
interior nulls." That has nothing to do with being ASCII or not. We have a 
worldwide charter and I don't think it's wise to think to ourselves, 
"Mostly everything's ASCII and things that are not ASCII are the rare 
case."

Also, I'd like to clarify the point that says, "Null byte is transformed 
into two bytes." I think it would be more clear to say "INTERIOR null 
bytes are transformed into two bytes," to distinguish from the null byte 
at the end in the Java format.

Finally, I don't like permitting multiple formats in our spec - one for 
UTF8 strings that come from JVMs and one for UTF8 strings that don't. I'd 
rather see us decide on a single format and require agents and clients to 
use it, converting if necessary. Here is part of my thinking: if you 
assume that every client has one "native" UTF8 format (either standard or 
Java style), and you further permit agents to put a "format bit" in their 
metadata, then you're forcing clients to write a converter from the 
non-native format (whichever one it is) to their native format. Since the 
client is known to be an Eclipse plug-in (right?) we know it's going to be 
happy with Java-style UTF8.

Since the canonical client for an agent is an Eclipse / Hyades plug-in, I 
propose that we standardize on Java-style UTF-8: null terminated, and 
interior nulls are encoded with two bytes. We can further say that strings 
with interior nulls are not allowed; therefore the difference in their 
encoding is moot.

We also have to say that the length count before every string is a count 
of BYTES in the encoded string, not a count of CHARACTERS in the resulting 
Unicode string. And we have to say that the count includes the null byte 
at the end. Otherwise the spec leaves implementers wondering.

-- Allan Pratt, apratt@xxxxxxxxxx
Rational software division of IBM




"Nguyen, Hoang M" <hoang.m.nguyen@xxxxxxxxx> 
Sent by: hyades-dev-admin@xxxxxxxxxxx
08/15/2004 10:32 PM
Please respond to
hyades-dev


To
<hyades-dev@xxxxxxxxxxx>
cc

Subject
[hyades-dev] UTF-8 as the data exchange format to and from the data 
collection engine






 
In our last ?Data Collection? weekly meeting, there is a question on how 
UTF-8 is supported
in cross-platform environments and between Java and C libraries. Here is 
the follow-up.
 
Brief introduction to UTF-8
UTF stands for Unicode Transformation Format
UTF uses bit-shifting techniques to encode Unicode characters as byte 
values
In UTF-8, each Unicode character is represented in between 1 and 4 bytes
For a complete encoding and decoding UTF-8, you can view its spec at
                 http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf
 
UTF-8 in Java and JNI 
In Java, UTF-8 strings are always 0-terminated
UTF-8 is upwards-compatible with 7-bit ASCII
Ref: 
http://java.sun.com/docs/books/tutorial/native1.1/implementing/string.html
 
But Java supports the ?Modified UTF-8 Strings? and not standard UTF-8
Null byte is transformed into two bytes
No four-byte transformation
Ref: http://java.sun.com/j2se/1.5.0/docs/guide/jni/spec/types.html
 
Recommendation:
It is the right approach to adopt UTF-8 as it is the universal and widely 
accepted choice for cross-platform data format.
But we do need to handle the case for the modified UTF-8 when the engine 
is running on a JVM
-          We want to simply expose this as a public attribute/property of 
the engine 
so that the client can choose to build appropriate format of the UTF-8 
stream.
-          This is expected to be a rare case if the data are expected to 
be in ASCII format
and the client can ignore this exception.
 
We can discuss this issue more on our next weekly meeting.
 



Back to the top