Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [hyades-dev] UTF-8 in HCE Data Exchange Format Proposal

Two things:

1. Wherever this ends up in our specs, we should mention whose standard 
we're borrowing from. Where does this encoding come from? People will be 
more accepting of it if they know we didn't just invent it, that we're 
leveraging an existing design. I know it makes *me* more accepting of this 
system that encodes the highest-runner case (short strings) in three 
bytes, not one.

2. I think this description misses the point of prohibiting embedded 
nulls. The point of prohibiting them is so the encoded string can be 
handled using C functions like strdup, strcmp (for equality only), and 
strlen (to get the encoded size, not the decoded size). The right thing to 
say about this is that the Unicode string is allowed to have interior null 
characters, but the UTF-8 encoded string must use the two-byte encoding 
for those characters. Just like Java.  It doesn't really matter: interior 
nulls are something you have to account for in the spec, but I don't think 
they're used much. Somebody correct me if I'm wrong.

-- Allan Pratt, apratt@xxxxxxxxxx
Rational software division of IBM




"Nguyen, Hoang M" <hoang.m.nguyen@xxxxxxxxx> 
Sent by: hyades-dev-admin@xxxxxxxxxxx
08/31/2004 05:53 AM
Please respond to
hyades-dev


To
<hyades-dev@xxxxxxxxxxx>
cc

Subject
[hyades-dev] UTF-8 in HCE Data Exchange Format Proposal






 
Hello all,
 
We have discussed and came to the following proposal in our last HCE team 
meeting regarding to UTF-8 usage:
 
If string data type is exchanged between the client and the HCE, it will 
be in the following format:
-          one byte indicates the size of the string length field: 2, 4 or 
8 bytes
-          the actual length field (2, 4 or 8 bytes)
-          UTF-8 byte stream with no embedded null byte and no terminating 
null byte
 
Here are some important points about this proposal:
 
This is only for string data type only. Other data types (int, float, 
double, etc.) are not affected.
 
No embedded null byte in the string.
If this requires, multiple strings should be created instead.
This should help C/C++ program to convert to its own string without 
additional checking.
 
No null byte at the end.
If there is a length field in the beginning, it should not be forced to 
add one more the end
(just for C program convenience).
It will be very unnatural for Java programs to do so.
 
            The expectation is that most Java or C++ programs will 
probably
convert it to String objects before manipulating anyway.
 
The first length indicator byte is really for optimal solution and 
scalability
-          most string lengths will be within 2-byte length size (my 
expectation is at 99%)
-          4-byte length is also possible (e.g. Allan?s environmental 
variable)
-          8-byte length is conceivable with the latest 64-bit memory 
addressing (as Frank de Jong pointed out)
 
As always, all inputs and comments are welcome and appreciated.
 
Regards,
 
 



Back to the top