Bug 562884 - Cleanup Babel translation database
Summary: Cleanup Babel translation database
Status: NEW
Alias: None
Product: Babel
Classification: Technology
Component: Server (show other bugs)
Version: unspecified   Edit
Hardware: All All
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Denis Roy CLA
QA Contact:
URL:
Whiteboard: stalebug
Keywords:
Depends on:
Blocks:
 
Reported: 2020-05-06 08:50 EDT by Kit Lo CLA
Modified: 2024-04-17 16:23 EDT (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kit Lo CLA 2020-05-06 08:50:26 EDT
The Translations table has 30 million rows, and 11GB of disk space. The first translations are dated 12 years ago, 2008-01-30, against Eclipse 3.4.

Since translations are copied from one version to the next, deleting those old translations would have no impact on the most recent branches.

MariaDB [babel]> select count(1) from translations where created_on < "2018-01-01";
+----------+
| count(1) |
+----------+
| 24496275 |
+----------+

A database cleanup could also speed up the server's queries.
Comment 1 Denis Roy CLA 2020-05-06 09:24:42 EDT
I'm executing this:
delete from translations where created_on < "2016-01-01";

I think keeping 4 years of translations is better than 2 years, for those projects that have longer release cycles.

Should we consider pruning old project versions?
Comment 2 Kit Lo CLA 2020-05-06 09:32:25 EDT
>Should we consider pruning old project versions?
Yes, for example we probably do not reference Eclipse 3.x projects anymore.
Comment 3 Andrew Johnson CLA 2020-05-06 09:35:46 EDT
Eclipse Memory Analyzer has several versions defined in Babel:

    1.10    (37.2%)
    1.8     (40.2%)
    1.3     (39.2%)
    1.2     (41.7%)
    1.1     (42.7%)
    1.0     (42.8%)

but definitions for 
1.10 2019-12
1.9 2019-06
1.8 photon
1.7 oxygen
1.6 neon
1.5 mars
1.4 luna
1.3 kepler
1.2 juno
1.1 indigo 
1.0 helios
0.8

The more interesting translations for us are the current version (1.10) and perhaps the previous (1.9).

I may need 1.3 or 1.4 translations too for the following reason. Eclipse Memory Analyzer 1.4 was distributed by IBM with IBM translations of the messages into Japanese and Chinese (Simplified). As the English messages were under the EPL, the translations should be too (though I'll check with IBM). Therefore, if I could import those messages into Babel at say the MAT 1.4 definition then Babel Syncup might then copy those translations into matching messages at the latest level. I don't know how Syncup works if there is already a translation at say the 1.0 level - is Syncup a one time occurrence if there is no translation at the later level, or does the latest independent previous translation win?

I would therefore like to keep the old definitions and translations for MAT in case I can get my idea to work.
Comment 4 Andrew Johnson CLA 2020-05-06 09:43:39 EDT
Also, I we losing all the translation credits for the original translations?

Most of the recent translations for Eclipse Memory Analyer are by 'Babel Syncup'. If I go back to previous versions I see 'Satoru Yoshida' translated some of the messages originally.

I think losing the top translators from the leader board will be sad and will discourage them.

Also, it might be nice to find the original source of the translation.
Comment 5 Kit Lo CLA 2020-05-06 09:44:45 EDT
Denis, while your are designing this, can you come up with some scripts that I can run periodically (for example once a year) to perform the cleanup?
Comment 6 Denis Roy CLA 2020-05-06 10:13:38 EDT
Syncup works on the current "train". In this case, 2020-03. It looks for exact English string matches across projects that have been translated, and applies those translations.

> I think losing the top translators from the leader board will be sad and
> will discourage them.

The leader board only takes into account the last 18-months, and excludes Syncup. Unfortunately, as babel copies all English strings from one version to the next, it's difficult (not impossible) to know who contributed a particular translation for a string.

Since everyone's cycles are low, I was not planning on doing anything beyond a plain DELETE.
Comment 7 Eclipse Genie CLA 2022-04-27 09:30:55 EDT
This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

If you have further information on the current state of the bug, please add it. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

--
The automated Eclipse Genie.
Comment 8 Eclipse Genie CLA 2024-04-17 16:23:29 EDT
This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

If you have further information on the current state of the bug, please add it. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

--
The automated Eclipse Genie.