Community
Participate
Working Groups
Driver : M6 Platform : Novell SuSe Linux Enterperise Server 9 sp1, ja_JP.UTF-8 locale Java :IBM J9 1.4.2 sr1a j9xia32142sr1a-20050209 steps 1. create DBCS name file and English name file in a simple project 2. Export > Archive the simple project by tar file 3. Import > Archive it result The import dialog display the DBCS file as BBBB. That is the same result I got by external tar command on the Konsole shell. Zip file export/import works fine. Import a tar file created by the external command will display bogus on the import wizard.
Created attachment 19734 [details] test data for recreation contains .project, 1111.txt and "DBCS name".txt
Created attachment 19736 [details] tar file created by external tar command contains 111.txt and DBCS name.txt
any status here Tod?
Billy owns the tar achives.
Fixed in HEAD. We now use UTF-8 encoding for filenames in tar files, which seems to be the correct standard for tar/pax.
I think that file name encoding should use system locale encoding. To use UTF- 8 for all case would be an issue.
I do not think that using the system locale setting is correct in a file format as it may change between systems. Furthermore, UTF-8 is only used for encoding and decoding filenames as represented in the tar file, which is not necessarily the representation used when the file is created on disk. I believe that using UTF-8 is a reasonable compromise for tar files. By IEEE Std 1003.1-2001, filenames in tar files should be limited to 7-bit ASCII, but this is violated by the tar file you posted, so we must be more liberal. However, since this is technically against the standard, what encoding is used seems more like a matter of taste. For pax extended headers, the standard is UTF-8 and all filenames must be encoded as such. Since I do intend to support pax extended headers (bug 84756), using the same idea here seems reasonable. If you have a specific case in mind where my approach causes problems, please re-open this bug again.
I have verified the fix in I20050509-2010 for M7.
I've read the pax header block spec and found that charset is not limited to ASCII mentioned at comment #7. Following the quote of the description. additional encodings are allowed. charset The name of the character set used to encode the data in the following file (s). The entries in the following table are defined to refer to known standards; additional names may be agreed on between the originator and recipient. If we create a tar file by Eclipse on non UTF-8 Japanese locale (e.g. AIX or Solaris), the tar file will create garbled file name by tar command on these UNIX, and vice versa. I'd like to ask adding encoding selection at export/import wizard.
My understanding is that the charset parameter you quote refers to the charset of the data in the file and not the filename. Please see the documentation for the header of the "ustar Interchange Format". In particular: "All characters in the header logical record shall be represented in the coded character set of the ISO/IEC 646:1991 standard. For maximum portability between implementations, names should be selected from characters represented by the portable filename character set as octets with the most significant bit zero." This is improved with the pax extended headers where filenames are always stored as UTF-8. See the section "pax Archive Character Set Encoding/Decoding". Do you have an example .tar archive created on an AIX or Solaris machine by their tar or pax utilities which includes Japanese filenames? If possible I would rather implement the pax header for bug 84756 than try to work around ustar's limitations in a more complicated way.
Created attachment 20949 [details] sample tar file created on AIX Created on AIX 5.1 Ja_JP locale. the encoding is equivalent to Windows Japanese it contains two files; one is abc.txt and the other is DBCS named file with .txt used tar command as "tar -cf 90963.tar *.txt"
Created attachment 20950 [details] sample pax file created on AIX Created on AIX 5.1 Ja_JP locale. the encoding is equivalent to Windows Japanese it contains two files; one is abc.txt and the other is DBCS named file with .txt used pax command as "pax -w -f 90963.pax *.txt"
By the way, RHEL4's pax command seems not support internationalization, unable to handle UTF-8 file name. Do you have any information Linux's pax?
My understanding is that the pax command distributed on Linux systems is quite old and does not support the new POSIX pax format which I believe was first defined in POSIX 1003.1-2001. However, I believe pax format archives can be read by newer versions of GNU tar, and created using --format=posix.
so Billy is there anything else you plan to do here? The interesting point is that Import can not displays the name of the file you just exported.
With the current code in RC1, Eclipse will always be able to correctly import tar files exported from Eclipse. The remaining issue is with importing tar files created on the command line on systems that do not use UTF-8 for their filenames, such as AIX 5.1. There is no way to detect the filename encoding used in a tar file. Therefore, the only solution would be to add an input field for the filename encoding to the import wizards as suggested in comment #9. I am worried about the complexity of this solution. Since ustar does not really support i18n filenames, and since the problem is limited to interoperability with external applications and not Eclipse itself, I think we should defer finding a more robust solution to post 3.1.
Steve, this is being deferred, please respond if there are any issues.
Since the original eclipse round trip bug is fixed and the way to fix this for some externally created files is not fully understood, I think it is fine to defer the rest of the work.
Changing summary to reflect remaining issue. Also see bug 84756.
"As per http://wiki.eclipse.org/Platform_UI/Bug_Triage_Change_2009"
This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. If you have further information on the current state of the bug, please add it. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.