Bug 90963 - [Import/Export] should be able to import/export using varying filename encoding
Summary: [Import/Export] should be able to import/export using varying filename encoding
Status: NEW
Alias: None
Product: Platform
Classification: Eclipse Project
Component: IDE (show other bugs)
Version: 3.1   Edit
Hardware: PC Linux-GTK
: P5 enhancement (vote)
Target Milestone: ---   Edit
Assignee: Platform UI Triaged CLA
QA Contact:
URL:
Whiteboard:
Keywords: helpwanted, nl
Depends on:
Blocks:
 
Reported: 2005-04-11 07:03 EDT by Masayuki Fuse CLA
Modified: 2019-09-06 15:36 EDT (History)
8 users (show)

See Also:


Attachments
test data for recreation (543 bytes, application/octet-stream)
2005-04-11 07:06 EDT, Masayuki Fuse CLA
no flags Details
tar file created by external tar command (10.00 KB, application/octet-stream)
2005-04-11 07:08 EDT, Masayuki Fuse CLA
no flags Details
sample tar file created on AIX (10.00 KB, application/octet-stream)
2005-05-11 08:12 EDT, Masayuki Fuse CLA
no flags Details
sample pax file created on AIX (10.00 KB, application/octet-stream)
2005-05-11 08:14 EDT, Masayuki Fuse CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Masayuki Fuse CLA 2005-04-11 07:03:18 EDT
Driver : M6
Platform : Novell SuSe Linux Enterperise Server 9 sp1, ja_JP.UTF-8 locale
Java :IBM J9 1.4.2 sr1a j9xia32142sr1a-20050209

steps
1. create DBCS name file and English name file in a simple project
2. Export > Archive the simple project by tar file
3. Import > Archive it

result
The import dialog display the DBCS file as BBBB. That is the same result I got
by external tar command on the Konsole shell. Zip file export/import works fine.

Import a tar file created by the external command will display bogus on the
import wizard.
Comment 1 Masayuki Fuse CLA 2005-04-11 07:06:02 EDT
Created attachment 19734 [details]
test data for recreation

contains .project, 1111.txt and "DBCS name".txt
Comment 2 Masayuki Fuse CLA 2005-04-11 07:08:43 EDT
Created attachment 19736 [details]
tar file created by external tar command

contains 111.txt and DBCS name.txt
Comment 3 Michael Van Meekeren CLA 2005-04-29 10:21:16 EDT
any status here Tod?
Comment 4 Tod Creasey CLA 2005-04-29 10:57:32 EDT
Billy owns the tar achives.
Comment 5 Billy Biggs CLA 2005-05-08 12:42:32 EDT
Fixed in HEAD.  We now use UTF-8 encoding for filenames in tar files, which
seems to be the correct standard for tar/pax.
Comment 6 Masayuki Fuse CLA 2005-05-10 03:26:03 EDT
I think that file name encoding should use system locale encoding. To use UTF-
8 for all case would be an issue.
Comment 7 Billy Biggs CLA 2005-05-10 10:54:07 EDT
I do not think that using the system locale setting is correct in a file format
as it may change between systems.  Furthermore, UTF-8 is only used for encoding
and decoding filenames as represented in the tar file, which is not necessarily
the representation used when the file is created on disk.

I believe that using UTF-8 is a reasonable compromise for tar files.  By IEEE
Std 1003.1-2001, filenames in tar files should be limited to 7-bit ASCII, but
this is violated by the tar file you posted, so we must be more liberal. 
However, since this is technically against the standard, what encoding is used
seems more like a matter of taste.  For pax extended headers, the standard is
UTF-8 and all filenames must be encoded as such.  Since I do intend to support
pax extended headers (bug 84756), using the same idea here seems reasonable.

If you have a specific case in mind where my approach causes problems, please
re-open this bug again.
Comment 8 Billy Biggs CLA 2005-05-10 23:06:40 EDT
I have verified the fix in I20050509-2010 for M7.
Comment 9 Masayuki Fuse CLA 2005-05-11 00:21:11 EDT
I've read the pax header block spec and found that charset is not limited to 
ASCII mentioned at comment #7. Following the quote of the description. 
additional encodings are allowed.

charset 
The name of the character set used to encode the data in the following file
(s). The entries in the following table are defined to refer to known 
standards; additional names may be agreed on between the originator and 
recipient.

If we create a tar file by Eclipse on non UTF-8 Japanese locale (e.g. AIX or 
Solaris), the tar file will create garbled file name by tar command on these 
UNIX, and vice versa.

I'd like to ask adding encoding selection at export/import wizard.
Comment 10 Billy Biggs CLA 2005-05-11 07:48:07 EDT
My understanding is that the charset parameter you quote refers to the charset
of the data in the file and not the filename.  Please see the documentation for
the header of the "ustar Interchange Format".  In particular:

"All characters in the header logical record shall be represented in the coded
character set of the ISO/IEC 646:1991 standard. For maximum portability between
implementations, names should be selected from characters represented by the
portable filename character set as octets with the most significant bit zero."

This is improved with the pax extended headers where filenames are always stored
as UTF-8.  See the section "pax Archive Character Set Encoding/Decoding".

Do you have an example .tar archive created on an AIX or Solaris machine by
their tar or pax utilities which includes Japanese filenames?  If possible I
would rather implement the pax header for bug 84756 than try to work around
ustar's limitations in a more complicated way.
Comment 11 Masayuki Fuse CLA 2005-05-11 08:12:23 EDT
Created attachment 20949 [details]
sample tar file created on AIX

Created on AIX 5.1 Ja_JP locale. the encoding is equivalent to Windows Japanese

it contains two files; one is abc.txt and the other is DBCS named file with
.txt
used tar command as "tar -cf 90963.tar *.txt"
Comment 12 Masayuki Fuse CLA 2005-05-11 08:14:21 EDT
Created attachment 20950 [details]
sample pax file created on AIX

Created on AIX 5.1 Ja_JP locale. the encoding is equivalent to Windows Japanese

it contains two files; one is abc.txt and the other is DBCS named file with
.txt
used pax command as "pax -w -f 90963.pax *.txt"
Comment 13 Masayuki Fuse CLA 2005-05-11 08:18:09 EDT
By the way, RHEL4's pax command seems not support internationalization, unable 
to handle UTF-8 file name.  Do you have any information Linux's pax?
Comment 14 Billy Biggs CLA 2005-05-15 07:37:32 EDT
My understanding is that the pax command distributed on Linux systems is quite
old and does not support the new POSIX pax format which I believe was first
defined in POSIX 1003.1-2001.  However, I believe pax format archives can be
read by newer versions of GNU tar, and created using --format=posix.
Comment 15 Michael Van Meekeren CLA 2005-06-03 10:00:28 EDT
so Billy is there anything else you plan to do here?

The interesting point is that Import can not displays the name of the file you
just exported.  

Comment 16 Billy Biggs CLA 2005-06-03 10:36:10 EDT
With the current code in RC1, Eclipse will always be able to correctly import
tar files exported from Eclipse.  The remaining issue is with importing tar
files created on the command line on systems that do not use UTF-8 for their
filenames, such as AIX 5.1.

There is no way to detect the filename encoding used in a tar file.  Therefore,
the only solution would be to add an input field for the filename encoding to
the import wizards as suggested in comment #9.

I am worried about the complexity of this solution.  Since ustar does not really
support i18n filenames, and since the problem is limited to interoperability
with external applications and not Eclipse itself, I think we should defer
finding a more robust solution to post 3.1.
Comment 17 Michael Van Meekeren CLA 2005-06-06 09:32:22 EDT
Steve, this is being deferred, please respond if there are any issues.
Comment 18 Steven Wasleski CLA 2005-06-08 09:29:02 EDT
Since the original eclipse round trip bug is fixed and the way to fix this for 
some externally created files is not fully understood, I think it is fine to 
defer the rest of the work.
Comment 19 Karice McIntyre CLA 2006-07-05 17:31:11 EDT
Changing summary to reflect remaining issue.  Also see bug 84756.
Comment 20 Susan McCourt CLA 2009-07-15 12:11:57 EDT
"As per http://wiki.eclipse.org/Platform_UI/Bug_Triage_Change_2009"
Comment 21 Eclipse Webmaster CLA 2019-09-06 15:36:56 EDT
This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

If you have further information on the current state of the bug, please add it. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.