[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [hyades-dev] More info on Java UTF-8
|
I don't agree that we should use this structure. The page you refer to
describes the CONSTANT_Utf8_info structure in the constant pool in a class
file. That format has a string length limit of 64K bytes, which might
prove inadequate for our needs. Sure, an agent's name string won't be over
65536 bytes long, but what about a large amount of data that's transferred
in a text format like XML? Java class files are rife with 16-bit
limitations that people are regretting today: on the number of byte codes
in a single method, and on the number of constant pool entries, for
example. It's dangerous to assume that 64K is "big enough" for strings.
Heck, I've seen CLASSPATH variables longer than that.
The first program in the message proves that the Java charset named UTF-8
is NOT the same as Java's "modified UTF-8" format (because it uses
single-byte interior nulls). The second program proves that writeUTF()
uses the modified format (two-byte nulls).
The ideas of encoding interior nulls as two bytes and providing a trailing
null are for C and C++ convenience. Their value rises in proportion to the
likelihood of having native code consuming the strings (and, to a lesser
extent, generating them).
I don't really care what format we use. But if nothing else this
discussion should make it clear that we have to think about it, and
specify our encoding EXACTLY.
-- Allan Pratt, apratt@xxxxxxxxxx
Rational software division of IBM
"Nguyen, Hoang M" <hoang.m.nguyen@xxxxxxxxx>
Sent by: hyades-dev-admin@xxxxxxxxxxx
08/18/2004 03:47 PM
Please respond to
hyades-dev
To
<hyades-dev@xxxxxxxxxxx>
cc
Subject
[hyades-dev] More info on Java UTF-8
Hello all,
If we want to adopt the Java UTF-8 form, we may want to consider adopting
its data structure as well.
Here is the spec of UTF-8 data structure in Java Virtual Machine (JVM)
http://java.sun.com/docs/books/vmspec/2nd-edition/html/ClassFile.doc.html#7963
2-byte length
followed by the UTF-8 byte stream
length does not contain the null character
In addition, I have verified that Java handles the single null byte
translation as well.
Please see attached programs.
We can discuss more and get some resolution on this issue in our weekly
meeting.
Regards,
/*
This demo program shows that Java can handle
UTF-8 file with null byte is translated as one byte.
*/
import java.io.* ;
public class MyUTF8Output
{
public static void main(String args[])
{
FileOutputStream fos ;
OutputStreamWriter osw ;
char[] msg = {'A', '\u0000', 'B', '\u0080', 'C', '\u0000'} ;
try
{
String s = new String(msg) ;
fos = new FileOutputStream("myoutput.txt");
osw = new OutputStreamWriter(fos, "UTF-8");
osw.write(s) ;
osw.flush() ;
fos.close();
System.out.println("See \"myoutput.txt\" file.") ;
}
catch (Exception e) { }
}
}
/*
This demo program shows that Java UTF-8 format is:
- 2-byte leng of the UTF-8 buffer
- null byte is mapped into two bytes
*/
import java.io.* ;
public class MyUTF8Conversion
{
public static void main(String args[])
{
FileOutputStream fos ;
DataOutputStream dos ;
char[] msg = {'A', '\u0000', 'B', '\u0080', 'C', '\u0000'} ;
try
{
String s = new String(msg) ;
fos = new FileOutputStream("myoutput2.txt");
dos = new DataOutputStream(fos);
dos.writeUTF(s) ;
dos.flush() ;
fos.close();
System.out.println("See \"myoutput2.txt\" file.") ;
}
catch (Exception e) { }
}
}