Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [stellation-res] Stellation status

On Mon, 2002-11-25 at 11:03, Mark C. Chu-Carroll wrote:

> > > My first question, before we get into any more detail, is how far
> > > does the internationalization need to go? There's a bunch of
> > > options:
> > > 
> > > (1) Be able to deal with international data. That is, make sure that
> > >   all artifacts, comments, labels, names, etc., all work even in
> > >   non-latin character sets. Since Java is working in UTF-8, as long as
> > >   the databases can handle UTF-8 characters correctly, we're already
> > >   there. But this requires testing to be sure in all the DBs - do you
> > >   have any good UTF-8 non-latin data that you can give us for testing?
> > > 
> > > (2) Everything in (1) plus have all system messages be in translatable
> > >   resources. Java and Eclipse both provide a lot of support for this,
> > >   but it's still difficult. 
> > > 
> > > (3) Everything in (2) plus have all command names and options also be
> > >   in translatable resources.
> > > 
> > > My opinion is that right now, (3) is probably not doable. Jonathan's
> > > experiments with pulling all string literals out into translatable
> > > resources showed what was involved. I think coping with it is just too
> > > much complexity for us right now. 
> > 
> > I need (1) working since yesterday and (2) in a near future.
> > 
> > My plan is to use Stellation as a text repository with version control 
> > for storing XML documents in many languages. Correct handling of Asian
> > languages is a must. I'm trying to standardize on UTF-8 encoding, but
> > anything can show up.
> 
> This is potentially a *very* serious problem, which is going to be
> extremely hard for us to address. We're using Java IO, and Java IO
> assumes that everything is encoded in UTF-8. If you push non-UTF8 text
> containing 8 bit character data through Java IO, 
> the misinterpretations can cause data loss and/or corruption.

I know how serious the problem is. I spent the last 2 months here in
Thailand fixing code that only worked with latin characters. I had to
add full UTF-8/Unicode support to several tools. I can do the same with
Stellation and at least I will be able to contact the original
developers.


> 
> The only workaround with the current code is to use binary artifacts,
> which use binary IO, and so avoid the Java UTF-8 assumptions. But binary
> artifact storage is currently very inefficient, and doesn't support
> merges. 
> 
> Since I don't know much about the character encodings in use, I'm not
> sure of exactly what will work. *If* you can easily identify 
> line breaks, *and* you can configure the database so that it will accept
> and correctly store strings in your character encodings, then you can
> probably create a variant of the current TextArtifact/TextArtifactAgent
> classes that will use binary IO, and work with byte[], instead of
> String. 

Don't worry too much. I had enough headaches learning that the Java
String class does not really support Unicode and UTF-8 as the Javadocs
says. Even when I read a String in chinese encoded as UTF-8 and try to
write to disk using the standard functions it is not as easy as it
should.

Line breaks is another serious problem in some languages. Thai, for
example, does not have white spaces, punctuation marks or anything that
you can uses to break a sentence. However, almost all the problems I
faced with Asian languages had a solution.


> > > (2) we could do if it was really necessary, but we'd need you to really
> > > dig in and help.
> > 
> > I will help with it.
> 
> Good. Make sure to grab a very recent copy of Eclipse. The latest
> builds have better refactoring support, and I've heard that one of
> the refactorings is an improved way of pulling inline String constants
> out.

I wish I could. I have a slow ISDN connection that vanishes from time to
time. There's nothing I can do, communications are pretty bad in
Thailand.

In a few days I will be in Singapore, where I will have better Internet
access.

Rodolfo

-- 
Rodolfo M.Raya <rmraya@xxxxxxxxxxxxxxx>
Maxprograms



Back to the top