Hello all,
If we want to adopt the Java UTF-8 form, we may want to
consider adopting its data structure as well.
Here is the spec of UTF-8 data structure in Java Virtual
Machine (JVM)
http://java.sun.com/docs/books/vmspec/2nd-edition/html/ClassFile.doc.html#7963
- 2-byte length
- followed by the UTF-8 byte stream
- length does not contain the null character
In addition, I have verified that Java handles the single null
byte translation as well.
Please see attached programs.
We can discuss more and get some resolution on this issue in
our weekly meeting.
Regards,
/*
This demo program shows that Java can handle
UTF-8 file with null byte is
translated as one byte.
*/
import java.io.* ;
public class MyUTF8Output
{
public static void main(String args[])
{
FileOutputStream
fos ;
OutputStreamWriter
osw ;
char[]
msg = {'A', '\u0000', 'B', '\u0080', 'C', '\u0000'} ;
try
{
String s = new String(msg) ;
fos = new FileOutputStream("myoutput.txt");
osw = new OutputStreamWriter(fos, "UTF-8");
osw.write(s) ;
osw.flush() ;
fos.close();
System.out.println("See \"myoutput.txt\" file.") ;
}
catch
(Exception e) { }
}
}
/*
This demo program shows that Java UTF-8 format
is:
-
2-byte leng of the UTF-8 buffer
- null byte is
mapped into two bytes
*/
import java.io.* ;
public class MyUTF8Conversion
{
public static void main(String args[])
{
FileOutputStream
fos ;
DataOutputStream
dos ;
char[]
msg = {'A', '\u0000', 'B', '\u0080', 'C', '\u0000'} ;
try
{
String s = new String(msg) ;
fos = new FileOutputStream("myoutput2.txt");
dos = new DataOutputStream(fos);
dos.writeUTF(s) ;
dos.flush() ;
fos.close();
System.out.println("See \"myoutput2.txt\" file.") ;
}
catch
(Exception e) { }
}
}