Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [hyades-dev] More info on Java UTF-8

I don't agree that we should use this structure. The page you refer to 
describes the CONSTANT_Utf8_info structure in the constant pool in a class 
file. That format has a string length limit of 64K bytes, which might 
prove inadequate for our needs. Sure, an agent's name string won't be over 
65536 bytes long, but what about a large amount of data that's transferred 
in a text format like XML? Java class files are rife with 16-bit 
limitations that people are regretting today: on the number of byte codes 
in a single method, and on the number of constant pool entries, for 
example. It's dangerous to assume that 64K is "big enough" for strings. 
Heck, I've seen CLASSPATH variables longer than that. 

The first program in the message proves that the Java charset named UTF-8 
is NOT the same as Java's "modified UTF-8" format (because it uses 
single-byte interior nulls). The second program proves that writeUTF() 
uses the modified format (two-byte nulls). 

The ideas of encoding interior nulls as two bytes and providing a trailing 
null are for C and C++ convenience. Their value rises in proportion to the 
likelihood of having native code consuming the strings (and, to a lesser 
extent, generating them).

I don't really care what format we use. But if nothing else this 
discussion should make it clear that we have to think about it, and 
specify our encoding EXACTLY.

-- Allan Pratt, apratt@xxxxxxxxxx
Rational software division of IBM




"Nguyen, Hoang M" <hoang.m.nguyen@xxxxxxxxx> 
Sent by: hyades-dev-admin@xxxxxxxxxxx
08/18/2004 03:47 PM
Please respond to
hyades-dev


To
<hyades-dev@xxxxxxxxxxx>
cc

Subject
[hyades-dev] More info on Java UTF-8






 
Hello all,
 
If we want to adopt the Java UTF-8 form, we may want to consider adopting 
its data structure as well.
 
Here is the spec of UTF-8 data structure in Java Virtual Machine (JVM)
   
http://java.sun.com/docs/books/vmspec/2nd-edition/html/ClassFile.doc.html#7963
 
2-byte length
followed by the UTF-8 byte stream
length does not contain the null character
 
In addition, I have verified that Java handles the single null byte 
translation as well.
Please see attached programs.
 
We can discuss more and get some resolution on this issue in our weekly 
meeting.
 
Regards,
 
 
 
/* 
   This demo program shows that Java can handle
      UTF-8 file with null byte is translated as one byte.
 */
 
import java.io.* ;
 
public class MyUTF8Output
{
     public static void main(String args[]) 
     {
          FileOutputStream    fos ;
 
          OutputStreamWriter  osw ;
 
          char[] msg = {'A', '\u0000', 'B', '\u0080', 'C', '\u0000'} ;
 
          try 
          {
              String s = new String(msg) ;
 
              fos = new FileOutputStream("myoutput.txt");
 
              osw = new OutputStreamWriter(fos, "UTF-8");
 
              osw.write(s) ;
 
              osw.flush() ;
 
              fos.close();
 
              System.out.println("See \"myoutput.txt\" file.") ;
          } 
          catch (Exception e) { } 
    }
}
 
 
 
/* 
   This demo program shows that Java UTF-8 format is:
        - 2-byte   leng of the UTF-8 buffer
        - null byte is mapped into two bytes
 */
 
import java.io.* ;
 
public class MyUTF8Conversion
{
     public static void main(String args[]) 
     {
          FileOutputStream    fos ;
 
          DataOutputStream    dos ;
 
          char[] msg = {'A', '\u0000', 'B', '\u0080', 'C', '\u0000'} ;
 
          try 
          {
              String s = new String(msg) ;
 
              fos = new FileOutputStream("myoutput2.txt");
 
              dos = new DataOutputStream(fos);
 
              dos.writeUTF(s) ;
 
              dos.flush() ;
 
              fos.close();
 
              System.out.println("See \"myoutput2.txt\" file.") ;
          } 
          catch (Exception e) { } 
    }
}
 
 



Back to the top