[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [hyades-dev] UTF-8 as the data exchange format to and from the data collection engine
|
I'm not sure I understand the last point, "This is expected to be a rare
case if the data are expected to be in ASCII format and the client can
ignore this exception." What's rare? What's ASCII? After reading this
twice I *think* you're saying "the two-byte encoding of interior nulls is
a difference we can ignore, since none of the strings in the protocol have
interior nulls." That has nothing to do with being ASCII or not. We have a
worldwide charter and I don't think it's wise to think to ourselves,
"Mostly everything's ASCII and things that are not ASCII are the rare
case."
Also, I'd like to clarify the point that says, "Null byte is transformed
into two bytes." I think it would be more clear to say "INTERIOR null
bytes are transformed into two bytes," to distinguish from the null byte
at the end in the Java format.
Finally, I don't like permitting multiple formats in our spec - one for
UTF8 strings that come from JVMs and one for UTF8 strings that don't. I'd
rather see us decide on a single format and require agents and clients to
use it, converting if necessary. Here is part of my thinking: if you
assume that every client has one "native" UTF8 format (either standard or
Java style), and you further permit agents to put a "format bit" in their
metadata, then you're forcing clients to write a converter from the
non-native format (whichever one it is) to their native format. Since the
client is known to be an Eclipse plug-in (right?) we know it's going to be
happy with Java-style UTF8.
Since the canonical client for an agent is an Eclipse / Hyades plug-in, I
propose that we standardize on Java-style UTF-8: null terminated, and
interior nulls are encoded with two bytes. We can further say that strings
with interior nulls are not allowed; therefore the difference in their
encoding is moot.
We also have to say that the length count before every string is a count
of BYTES in the encoded string, not a count of CHARACTERS in the resulting
Unicode string. And we have to say that the count includes the null byte
at the end. Otherwise the spec leaves implementers wondering.
-- Allan Pratt, apratt@xxxxxxxxxx
Rational software division of IBM
"Nguyen, Hoang M" <hoang.m.nguyen@xxxxxxxxx>
Sent by: hyades-dev-admin@xxxxxxxxxxx
08/15/2004 10:32 PM
Please respond to
hyades-dev
To
<hyades-dev@xxxxxxxxxxx>
cc
Subject
[hyades-dev] UTF-8 as the data exchange format to and from the data
collection engine
In our last ?Data Collection? weekly meeting, there is a question on how
UTF-8 is supported
in cross-platform environments and between Java and C libraries. Here is
the follow-up.
Brief introduction to UTF-8
UTF stands for Unicode Transformation Format
UTF uses bit-shifting techniques to encode Unicode characters as byte
values
In UTF-8, each Unicode character is represented in between 1 and 4 bytes
For a complete encoding and decoding UTF-8, you can view its spec at
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf
UTF-8 in Java and JNI
In Java, UTF-8 strings are always 0-terminated
UTF-8 is upwards-compatible with 7-bit ASCII
Ref:
http://java.sun.com/docs/books/tutorial/native1.1/implementing/string.html
But Java supports the ?Modified UTF-8 Strings? and not standard UTF-8
Null byte is transformed into two bytes
No four-byte transformation
Ref: http://java.sun.com/j2se/1.5.0/docs/guide/jni/spec/types.html
Recommendation:
It is the right approach to adopt UTF-8 as it is the universal and widely
accepted choice for cross-platform data format.
But we do need to handle the case for the modified UTF-8 when the engine
is running on a JVM
- We want to simply expose this as a public attribute/property of
the engine
so that the client can choose to build appropriate format of the UTF-8
stream.
- This is expected to be a rare case if the data are expected to
be in ASCII format
and the client can ignore this exception.
We can discuss this issue more on our next weekly meeting.