[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [hyades-dev] UTF-8 in HCE Data Exchange Format Proposal
|
Two things:
1. Wherever this ends up in our specs, we should mention whose standard
we're borrowing from. Where does this encoding come from? People will be
more accepting of it if they know we didn't just invent it, that we're
leveraging an existing design. I know it makes *me* more accepting of this
system that encodes the highest-runner case (short strings) in three
bytes, not one.
2. I think this description misses the point of prohibiting embedded
nulls. The point of prohibiting them is so the encoded string can be
handled using C functions like strdup, strcmp (for equality only), and
strlen (to get the encoded size, not the decoded size). The right thing to
say about this is that the Unicode string is allowed to have interior null
characters, but the UTF-8 encoded string must use the two-byte encoding
for those characters. Just like Java. It doesn't really matter: interior
nulls are something you have to account for in the spec, but I don't think
they're used much. Somebody correct me if I'm wrong.
-- Allan Pratt, apratt@xxxxxxxxxx
Rational software division of IBM
"Nguyen, Hoang M" <hoang.m.nguyen@xxxxxxxxx>
Sent by: hyades-dev-admin@xxxxxxxxxxx
08/31/2004 05:53 AM
Please respond to
hyades-dev
To
<hyades-dev@xxxxxxxxxxx>
cc
Subject
[hyades-dev] UTF-8 in HCE Data Exchange Format Proposal
Hello all,
We have discussed and came to the following proposal in our last HCE team
meeting regarding to UTF-8 usage:
If string data type is exchanged between the client and the HCE, it will
be in the following format:
- one byte indicates the size of the string length field: 2, 4 or
8 bytes
- the actual length field (2, 4 or 8 bytes)
- UTF-8 byte stream with no embedded null byte and no terminating
null byte
Here are some important points about this proposal:
This is only for string data type only. Other data types (int, float,
double, etc.) are not affected.
No embedded null byte in the string.
If this requires, multiple strings should be created instead.
This should help C/C++ program to convert to its own string without
additional checking.
No null byte at the end.
If there is a length field in the beginning, it should not be forced to
add one more the end
(just for C program convenience).
It will be very unnatural for Java programs to do so.
The expectation is that most Java or C++ programs will
probably
convert it to String objects before manipulating anyway.
The first length indicator byte is really for optimal solution and
scalability
- most string lengths will be within 2-byte length size (my
expectation is at 99%)
- 4-byte length is also possible (e.g. Allan?s environmental
variable)
- 8-byte length is conceivable with the latest 64-bit memory
addressing (as Frank de Jong pointed out)
As always, all inputs and comments are welcome and appreciated.
Regards,