RE: [hyades-dev] More info on Java UTF-8

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

RE: [hyades-dev] More info on Java UTF-8

From: Allan K Pratt <apratt@xxxxxxxxxx>
Date: Thu, 19 Aug 2004 11:49:56 -0700
Delivered-to: hyades-dev@xxxxxxxxxxx
List-archive: <http://dev.eclipse.org/pipermail/hyades-dev/>
List-help: <mailto:hyades-dev-request@eclipse.org?subject=help>
List-subscribe: <http://dev.eclipse.org/mailman/listinfo/hyades-dev>, <mailto:hyades-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <http://dev.eclipse.org/mailman/listinfo/hyades-dev>, <mailto:hyades-dev-request@eclipse.org?subject=unsubscribe>

Actually, I'd noticed that. Since messages like the filter specification 
include Java class names, this implies that the filtering doesn't work for 
non-ASCII class names on EBCDIC targets, even though such class names are 
perfectly acceptable to the JVM. I would expect these class names to be 
encoded as UTF-8 in the message, but the conversion to EBCDIC will fail 
for non-ASCII char values. Am I right? 

This is one reason I think we should use UTF-8 throughout the new 
protocol, regardless of the local machine encoding. Transcoding should 
occur only when a string enters or leaves the "protocol space." But in the 
new protocol we're defining, a message from an Eclipse workbench to a 
Java-focused agent should never undergo transcoding to EBCDIC, even if 
it's transmitted to an EBCDIC machine.

-- Allan Pratt, apratt@xxxxxxxxxx
Rational software division of IBM




Dave Smith <smith@xxxxxxxxxx> 
Sent by: hyades-dev-admin@xxxxxxxxxxx
08/19/2004 08:08 AM
Please respond to
hyades-dev


To
hyades-dev@xxxxxxxxxxx
cc

Subject
RE: [hyades-dev] More info on Java UTF-8







The only conversion the current RAC does on string data in control 
messages is EBCDIC/ASCII conversion on EBCDIC machines.  It doesn't do any 
internationalization conversion for the local machine. 

Dave N. Smith
Data Collection Tooling 
IBM Toronto Laboratory
D3/RNQ/8200/MKM
External: 905-413-3916
Tieline: 969-3916
smith@xxxxxxxxxx




"Kaylor, Andrew" <andrew.kaylor@xxxxxxxxx> 
Sent by: hyades-dev-admin@xxxxxxxxxxx 
08/18/2004 07:35 PM 

Please respond to
hyades-dev


To
<hyades-dev@xxxxxxxxxxx> 
cc

Subject
RE: [hyades-dev] More info on Java UTF-8








I also don?t think we need to match Java?s internal representation of 
strings.  The only things we need to worry about are: 
  
1. If the data is received via a Java socket, how easily can it be 
translated into a Java string? 
2. If the data is sent via a Java socket, how easily can it be gotten from 
a Java string? 
3. If the data is received via a C socket, how easily can in be translated 
into a standard C format? 
4. If the data is sent via a C socket, how easily can in be gotten from a 
standard C format? 
5. If the data is sent to an agent which is primarily implemented in Java, 
how easily can it make the JNI transition between the C stub and the Java 
agent? 
6. If the data is sent from an agent which is primarily implemented in 
Java, how easily can it make the JNI transition between the Java agent and 
the C stub? 
  
In the case of the third and fourth questions, a lot depends on how we 
actually plan to represent character strings in the HCE code itself.  It 
seems likely that we?ll be translating out of the wire format and into 
some ?local? format early on in the process.  So we need to have a 
strategy for having the HCE code internationalizable. 
  
Does the current RAC worry about that? 
  
-Andy 
  
-----Original Message-----
From: hyades-dev-admin@xxxxxxxxxxx [mailto:hyades-dev-admin@xxxxxxxxxxx] 
On Behalf Of Nguyen, Hoang M
Sent: Wednesday, August 18, 2004 3:47 PM
To: hyades-dev@xxxxxxxxxxx
Subject: [hyades-dev] More info on Java UTF-8 
  
  
Hello all, 
  
If we want to adopt the Java UTF-8 form, we may want to consider adopting 
its data structure as well. 
  
Here is the spec of UTF-8 data structure in Java Virtual Machine (JVM) 
   
http://java.sun.com/docs/books/vmspec/2nd-edition/html/ClassFile.doc.html#7963 

  
·         2-byte length 
·         followed by the UTF-8 byte stream 
·         length does not contain the null character 
  
In addition, I have verified that Java handles the single null byte 
translation as well. 
Please see attached programs. 
  
We can discuss more and get some resolution on this issue in our weekly 
meeting. 
  
Regards, 
  
  
  
/* 
   This demo program shows that Java can handle 
      UTF-8 file with null byte is translated as one byte. 
 */ 
  
import java.io.* ; 
  
public class MyUTF8Output 
{ 
     public static void main(String args[]) 
     { 
          FileOutputStream    fos ; 
  
          OutputStreamWriter  osw ; 
  
          char[] msg = {'A', '\u0000', 'B', '\u0080', 'C', '\u0000'} ; 
  
          try 
          { 
              String s = new String(msg) ; 
  
              fos = new FileOutputStream("myoutput.txt"); 
  
              osw = new OutputStreamWriter(fos, "UTF-8"); 
  
              osw.write(s) ; 
 
              osw.flush() ; 
  
              fos.close(); 
        
              System.out.println("See \"myoutput.txt\" file.") ; 
          } 
          catch (Exception e) { }       
    } 
} 
  
  
  
/* 
   This demo program shows that Java UTF-8 format is: 
        - 2-byte   leng of the UTF-8 buffer 
        - null byte is mapped into two bytes 
 */ 
  
import java.io.* ; 
  
public class MyUTF8Conversion 
{ 
     public static void main(String args[]) 
     { 
          FileOutputStream    fos ; 
  
          DataOutputStream    dos ; 
  
          char[] msg = {'A', '\u0000', 'B', '\u0080', 'C', '\u0000'} ; 
  
          try 
          { 
              String s = new String(msg) ; 
  
              fos = new FileOutputStream("myoutput2.txt"); 
  
              dos = new DataOutputStream(fos); 
  
              dos.writeUTF(s) ; 
 
              dos.flush() ; 
  
              fos.close(); 
        
              System.out.println("See \"myoutput2.txt\" file.") ; 
          } 
          catch (Exception e) { }       
    } 
}

Follow-Ups:
- RE: [hyades-dev] More info on Java UTF-8
  - From: Dave Smith

References:
- RE: [hyades-dev] More info on Java UTF-8
  - From: Dave Smith

Prev by Date: [hyades-dev] Notes from HCE meeting Aug 19 about variables
Next by Date: RE: [hyades-dev] Notes from HCE meeting Aug 19 about variables
Previous by thread: RE: [hyades-dev] More info on Java UTF-8
Next by thread: RE: [hyades-dev] More info on Java UTF-8
Index(es):
- Date
- Thread

Breadcrumbs