255700 – [import] If import string == Engligh string, don't import

Bug 255700 - [import] If import string == Engligh string, don't import

Summary: [import] If import string == Engligh string, don't import

Status:	RESOLVED FIXED

Alias:	None

Product:	Babel
Classification:	Technology
Component:	Server (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	P3 normal (vote)
Target Milestone:	---
Assignee:	Babel server inbox
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	254964
	Show dependency tree

Reported:	2008-11-18 14:42 EST by Denis Roy
Modified:	2008-11-20 13:55 EST (History)
CC List:	3 users (show)

See Also:

Attachments
Patch v1 (1.46 KB, patch) 2008-11-18 15:02 EST, Denis Roy	no flags	Details \| Diff
Additional Patch (4.16 KB, patch) 2008-11-19 14:37 EST, Denis Roy	no flags	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Denis Roy

2008-11-18 14:42:48 EST

cast=(Window^),flags=gcobject -> cast=(Window^),flags=gcobject [de: OS_Window_1WindowStyle__I_0] eclipse 3.4

When running an import, we should ignore incoming strings that are identical to the English one.  For one, the 'translation' is useless, and for two, it will make the database (and subsequently, the language pack) much bigger.

The specific culprit here is SWT -- they have millions of externalized strings that really aren't translatable.

The downside of doing this is that, after running the import, the %complete statistics won't show 100% as Babel will still think the strings still need to be translated.

Marking this as a blocker to the Adobe contribution, bug 254964.

Comment 1 Denis Roy

2008-11-18 15:02:19 EST

Created attachment 118180 [details]
Patch v1

Here's a quick patch for this.

Kit, there's nothing wrong with leaving the English string as-is right?

Comment 2 Denis Roy

2008-11-18 15:03:15 EST

Comment on attachment 118180 [details]
Patch v1

Flagging as a patch.

FWIW, I examined some of the SWT strings, and as it turns out, the culprit .properties files were deleted from CVS.

Comment 3 Kit Lo

2008-11-19 01:17:46 EST

>Kit, there's nothing wrong with leaving the English string as-is right?

Leaving the translation in English serves one purpose: it means translator has actually looked at this string and determined that it should not be translated.

If we find any strings like this, technically, we (either we the translators or the developers) should go back to the English file and marks the strings non-translatable. But, like you said, there are too many of them. It's not easy to mark them all individually. We may have to wait for some PDE enhancement I mentioned in bug 217263 to help us mark certain files or directories as non-translatable.

Talking about "translatable" strings, I think the import script is not checking to see if the string we are importing is translatable or not.

I think we've fixed the stats not to count non-translatable strings. If we don't import non-translatable strings, the stats should still show 100%.

Comment 4 Hendrik Maryns

2008-11-19 05:32:01 EST

I think this, and the issues with Eclipse genie, implies that we need a more elaborate translating system.  There needs to be something like ‘suggested translation’, which a reviewer can then accept with one click.  I would propose all imported and syncup translations are marked like this per default (think of the ‘fuzzy’ feature of gettext.  These strings could still be used when creating langpacks, but we would need a separate statistic saying how many of those fuzzy strings are still around.

A similar mechanism is needed for marking as non-translatable, such that the syncup script could do this, and you could do this with the theme of this bug: strings that are the same as in English.  But notice that in a lot of languages ‘OK’ will be translated with ‘OK’, so you’d have to check whether the translations are the same *in several languages* before you check it as non-translatable.  And then this should still be a ‘fuzzy’ non-translatable which, again, a reviewer could accept with one click or reject and correct.

As a consequence, we’d need an extension of recent.php which allows one to see only fuzzy strings.

I think it really is a good idea to look at gettext and mediawiki and other translation systems.  They are that complicated for a reason, there is years of translating experience gone into those systems (at least gettext).  Learn from the experience of others.

Comment 5 Hendrik Maryns

2008-11-19 05:58:12 EST

See also bug 254847 (I think this one should depend on it, but I seem not to be able to set such dependencies) and bug 254418 for other issues with Eclipse Genie translations.

Comment 6 Denis Roy

2008-11-19 09:49:22 EST

(In reply to comment #4)
> I would
> propose all imported and syncup translations are marked like this per default
> (think of the ‘fuzzy’

Agreed, but that is offtopic for this bug. I have opened bug 255798 for this.

Comment 7 Denis Roy

2008-11-19 14:37:52 EST

Created attachment 118303 [details]
Additional Patch

As a result of some discussion on the mailing list re: import overwriting values that may be good, here is a patch for the import routine.

I'm going based on this new 'import policy' [1]
-- Determine the accuracy of the contribution.
   1. Reviewed: Translations were done by professionals, and were reviewed and tested in context by loading them up in Eclipse
   2. Fuzzy: Translations were not done by professional translators; translations were done using software and dictionaries; translations were done by professional translators, but were not reviewed and tested in context.
   3. If unsure, ask the contributor. If unsure, choose Fuzzy. 

[1] http://wiki.eclipse.org/Babel_/_Large_Contribution_Import_Process

When the Fuzzy factor is set in the import script
  - incoming translations that overwrite non-fuzzy translations will be identified as fuzzy automatically. 
  - incoming translations that overwrite fuzzy translations will be identified as non-fuzzy automatically.
  - incoming translations that are version 1 of a string translations will not be flagged as fuzzy.

I haven't done any work on syncup because we're not ready to run it again, but I have the distinct feeling that syncup will flag all its translations as fuzzy.

This patch is absolutely not tested, but I will stage it and re-import the Adobe contribution to test.


(In reply to comment #3)
> If we find any strings like this, technically, we (either we the translators or
> the developers) should go back to the English file and marks the strings
> non-translatable.

This is likely easier to do with syncup. Hard to do while importing.

> Talking about "translatable" strings, I think the import script is not checking
> to see if the string we are importing is translatable or not.

Good point.  I have addressed that with the patch.

Comment 8 Antoine Toulmé

2008-11-19 14:41:44 EST

(In reply to comment #7)
The patch looks good. I could not test it, but I trust you will know very quickly how it performs with the test on the staging machine.

Comment 9 Denis Roy

2008-11-19 17:02:00 EST

It seems to be working as expected on staging.  I re-loaded the Adobe contribution, and it looks good.

I'll keep examining the fuzzy translations to make sure they were correctly flagged.

mysql> select l.language_id, l.name, count(1) from translations as t inner join languages as l on l.language_id = t.language_id where t.userid = 33696 group by l.language_id, l.name;
+-------------+----------+----------+
| language_id | name     | count(1) |
+-------------+----------+----------+
|           2 | French   |    11229 |
|           4 | German   |    10978 |
|           8 | Japanese |    10761 |
|           9 | Korean   |     8565 |
|          11 | Chinese  |     7908 |
|          12 | Chinese  |     8632 |
+-------------+----------+----------+
6 rows in set (0.26 sec)


mysql> select l.language_id, l.name, count(1) from translations as t inner join languages as l on l.language_id = t.language_id where t.userid = 33696 and possibly_incorrect group by l.language_id, l.name;
+-------------+----------+----------+
| language_id | name     | count(1) |
+-------------+----------+----------+
|           2 | French   |     6997 |
|           4 | German   |     6592 |
|           8 | Japanese |     6848 |
|           9 | Korean   |     2362 |
|          11 | Chinese  |     3758 |
|          12 | Chinese  |     2503 |
+-------------+----------+----------+
6 rows in set (0.41 sec)

Comment 10 Denis Roy

2008-11-20 13:55:35 EST

This is fixed > R_0_200811201351