409427 – [model] replace xml persitence with binary

Bug 409427 - [model] replace xml persitence with binary

Summary: [model] replace xml persitence with binary

Status:	RESOLVED FIXED

Alias:	None

Product:	z_Archived
Classification:	Eclipse Foundation
Component:	Mylyn (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P3 enhancement (vote)
Target Milestone:	2.0.1
Assignee:	Miles Parker
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	410520
	Show dependency tree

Reported:	2013-05-29 16:02 EDT by Miles Parker
Modified:	2013-06-19 14:08 EDT (History)
CC List:	4 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Miles Parker

2013-05-29 16:02:51 EDT

For release, we should rpelace the existing xml format with the much denser and faster loading EMF BinaryResource implementation. This is a pretty straightforward swap, but we'll have to manage replacement of existing artifacts.

Comment 1 Steffen Pingel

2013-06-03 09:16:12 EDT

Do we have indication that using XML has a noticeable performance impact? If not I would suggest to postpone this change since there is some risk with using a different format particularly concerning handling of existing data and we are getting close to RC3.

Comment 2 Miles Parker

2013-06-03 12:50:22 EDT

Yes, from my own experience, I've seen vast improvements in load time performance (1) which is critical here, as well as file size, which is also important as there many files that we'll be storing in user config and in worst case those files can be multiple MBs each. The real problem w/ xml load size is that it is actually super-linear.

1. http://milesparker.blogspot.ca/2011/01/supporting-multiple-resource-types-with.html
2. http://www.slideshare.net/kenn.hussey/performance-and-extensibility-with-emf  slide 34

But yes, it does introduce additional risk as well. I'm not wanting to touch this until we get bug 409343 resolved and by then it may be too late to make a significant change like this for 2.0. OTOH, you could argue that it is better to get a risky change like this out of the way rather than leave users with a lot of files that need to be deleted after release. The real risk seems to come in when we're messing around with the configuration state on people's machines, so there is some argument for getting it out of the way now.

Comment 3 Miles Parker

2013-06-10 20:11:56 EDT

Reschedule for service release. I think this is a key change, but it doesn't make any sense to introduce such a large change right before RC4.

Comment 4 Sam Davis

2013-06-12 17:12:57 EDT

Wouldn't it make sense to store the patch sets in the local git repo instead?

Comment 5 Miles Parker

2013-06-12 17:56:16 EDT

(In reply to comment #4)
> Wouldn't it make sense to store the patch sets in the local git repo instead?

This isn't really germane to this bug as we'd make this change either way, but..

Yes, something like that was under discussion with R4E team. In fact, we've already changed the model to support access to org.eclipse.team.core.history.IFileRevision for just this reason as well as for better integration with IDE. But note that storing it this might not be a good general solution and would require a good deal of additional engineering, while the immediate benefits of doing that aren't really clear. (Note we can't assume that all backing repos -- in contrast to a git repos specifically for this purpose -- will be git, as Reviews is intended to be Review system agnostic.

Comment 6 Miles Parker

2013-06-12 19:18:20 EDT

https://git.eclipse.org/r/13782

I did some quick memory and performance benchmarks. (Ed, cc'ing you bcause I thought you might be interested in these..)

For a repos with only one set of Synced patch sets, the results aren't that impressive. 15.7MB binary v. 20.6 MB xml. This is because the initial files don't have patch sets and both have overhead for the files themselves.

For very large Reviews, the advantage is more clear.  5.6 MB binary vs. 19.9 MB xml. More important for users are load times for a large review: 334ms vs. 934ms.

The Binary serialization should also be more robust to encoding issues, corruption, etc..

Comment 7 Ed Merks

2013-06-13 00:08:21 EDT

Miles, depending on the type of content, you might get further improvements from BinaryResourceImpl.OPTION_STYLE_DATA_CONVERTER.  I.e., this allows you to define binary data converters in your model's factory for values of data type that can be more compactly stored than via their string representation.  If if you don't have your own data converters, you will likely still benefit from this option because even strings themselves will be decomposed at delimiters and will be tabulated so repeated substrings will be serialized just once.

Comment 8 Miles Parker

2013-06-13 13:35:27 EDT

(In reply to comment #7)
> Miles, depending on the type of content, you might get further improvements from
> BinaryResourceImpl.OPTION_STYLE_DATA_CONVERTER.  I.e., this allows you to define
> binary data converters in your model's factory for values of data type that can
> be more compactly stored than via their string representation.  If if you don't
> have your own data converters, you will likely still benefit from this option
> because even strings themselves will be decomposed at delimiters and will be
> tabulated so repeated substrings will be serialized just once.

Thanks, Ed, that's cool! Unfortunately, we're not on 2.9 yet, so we can't do that, but downstream I could see doing that with the code.

What I'm not clear on is why strings themselves aren't stored in a more compact binary format. It looks like from API that this option could handle some kind of compression algorithm pretty cleanly, but maybe you wanted to avoid that overhead OOTB.

Comment 9 Miles Parker

2013-06-13 13:40:48 EDT

Tomek wonders about the XML data that is left when we switch to binary. I'd thought about deleting this, but I'm always a little leery of using File.delete and a raw OS path on someone else's computer in the same statement. Just paranoid, I guess. ;) Plus we have the issue of needing to check this at startup for at least SR1. OTOH, we will be leaving 10-100MB of dross per workspace around, so it's probably worth doing..?

Comment 10 Sam Davis

2013-06-13 14:07:58 EDT

I don't think users should be storing their own data in that location and expecting it not to be deleted, so I think we should clean it up.

Comment 11 Miles Parker

2013-06-13 14:13:04 EDT

(In reply to comment #10)
> I don't think users should be storing their own data in that location and
> expecting it not to be deleted, so I think we should clean it up.

Right, I was just justifying my laziness by making a very weak claim to risk-aversion. ;) I'll add that to review; I'm not sure where to put the check though. Perhaps when we get a client for a particular repository? That would mean that we'd still leave the data around for inactive repositories, but see also bug 408475.

Comment 12 Steffen Pingel

2013-06-13 16:48:51 EDT

Cleaning up makes sense particularly if it's just cached data. Miles, please also note the old locations on bug 308733: implement clean up to remove legacy files and settings. I remember that we already migrated away from a "model/reviews" directory I believe.

Comment 13 Miles Parker

2013-06-13 17:01:59 EDT

(In reply to comment #12)
> Cleaning up makes sense particularly if it's just cached data. Miles, please
> also note the old locations on bug 308733: implement clean up to remove legacy
> files and settings. I remember that we already migrated away from a
> "model/reviews" directory I believe.

Yep, that's already part of the change I'll be committing RSN.

Comment 14 Ed Merks

2013-06-14 02:57:10 EDT

Given that EMF 2.9 can be installed in Eclipse 3.5, and is part of the 4.3 Eclipse platform, and given that it makes significant footprint and performance improvements to the URI implementation as well as to the binary resource implementation, there don't appear to be compelling reasons not to use it.

Note that instead of using the literal for the new option, you could use your own copy of the string, and in that case the option will be recognized if the user is using EMF 2.9 and ignored otherwise. Of course the serializations based on that option can't be read with an older version of EMF...

Strings are already stored in a form that typically takes one byte per character so they are relatively compact, the best way to save space is to not store so many of the same strings (or substrings).  Of course one could try to compress the strings themselves using a zip-like algorithm, but even a UTF encoding is relatively time consuming (and isn't typically more compact than what we do now). Also, a binary resource implementation isn't just designed to be more compact, but also to load and save faster, so the complexity of the algorithm for processing the bytes has a significant impact on that performance.

In terms of cleaning up, you should be able to use the URI converter's delete if you know exactly the URI of the old resource.  I agree that it really ought to be cleaned up.

Comment 15 Miles Parker

2013-06-14 13:52:28 EDT

Please see https://git.eclipse.org/r/#/c/13782/4 for binary file cleanup. (That's quite simple as we just need to delete the entire directory, but I took the opportunity to clean up some of the tests as well, so it looks like a much bigger change then it really is.)

(In reply to comment #14)
> Given that EMF 2.9 can be installed in Eclipse 3.5, and is part of the 4.3
> Eclipse platform, and given that it makes significant footprint and performance
> improvements to the URI implementation as well as to the binary resource
> implementation, there don't appear to be compelling reasons not to use it.

There "shouldn't" be but Steffen had some concerns about build chain that are frankly a bit above my head. :) See bug 407104 comment 3. So it is good to have one more data point for updating to 2.9 across the board. It does seem like we should be able to just move everything to 2.9, but there may be an issue I'm missing there.

> Note that instead of using the literal for the new option, you could use your
> own copy of the string, and in that case the option will be recognized if the
> user is using EMF 2.9 and ignored otherwise. Of course the serializations based
> on that option can't be read with an older version of EMF...
> 
> Strings are already stored in a form that typically takes one byte per character
> so they are relatively compact, the best way to save space is to not store so
> many of the same strings (or substrings).  Of course one could try to compress

Most of our data is large text blocks, which aren't likely to be identical. Since this is code, and the way to really save space is with diffs, we will probably end up putting it in a locally managed git repos anyway, but that's a long-term job.

> the strings themselves using a zip-like algorithm, but even a UTF encoding is
> relatively time consuming (and isn't typically more compact than what we do
> now). Also, a binary resource implementation isn't just designed to be more
> compact, but also to load and save faster, so the complexity of the algorithm
> for processing the bytes has a significant impact on that performance.

Got it. Yeah I wouldn't think it made sense to do the compression OOTB -- w/ memory so cheap it rarely makes sense to compress just for a constant improvement in size. I was niavly thinking there might be a simple way to compress all of the text w/ some kind of Lempel-Ziv variant. I think that's basically what you're doing but at field level granulatiry. Perhpas it's worth considering an option for more aggressive compression so people are able to explicitly decide where they want to be on the size/performance tradeoff.

> In terms of cleaning up, you should be able to use the URI converter's delete if
> you know exactly the URI of the old resource.  I agree that it really ought to
> be cleaned up.

Yeah, see above, we don't even have to go that far. When I said I was being lazy, I really meant I was being lazy :# -- it's in essence 8 lines of code.

Comment 16 Ed Merks

2013-06-15 06:03:24 EDT

Note that the new option will break a large string into substrings at delimiters, i.e., at non-letter/non-digit characters.  The string table is prepopulated with the single character delimiter strings (and with single digit strings because strings like "0" and "1" are so common, so that such strings can be written as a single byte (a compressed int ID), i.e., no bigger than their one-byte character representation within a longer string.  The string table is then further populated with all such substrings, and those have a much higher chance of being duplicates (i.e., many uses of the same identifier/word in any language-based text) and hence a much higher chance of being compressed.

Comment 17 Miles Parker

2013-06-17 15:34:43 EDT

Tired an experiment with the new option. Didn't see an appreciable difference for most reviews. The only exception is *very* large reviews, with many patch sets. 17.5 MB v. 13 MB --  probably wasn't enough to justify using on it's own.

It would be nice to see these eventually support sequences across objects, as we have a *lot* of boilerplate e.g "Build Unstable"..

Comment 18 Miles Parker

2013-06-17 16:36:37 EDT

I'm ready to merge https://git.eclipse.org/r/#/c/13782/5 but it's a big enough change that I'd like to see a sanity check/verification from someone else..

Comment 19 Miles Parker

2013-06-17 19:13:04 EDT

Merged  https://git.eclipse.org/r/#/c/13782

Comment 20 Miles Parker

2013-06-17 20:32:46 EDT

This change may have caused an integration test failure.  See https://hudson.eclipse.org/hudson/job/mylyn-reviews-nightly/2279/...

Comment 21 Ed Merks

2013-06-18 01:59:34 EDT

I'm not sure what you mean by the https://bugs.eclipse.org/bugs/show_bug.cgi?id=409427#c17 comment? 

I wonder if you were you testing the new option in combination with using the ZIP option?

Comment 22 Miles Parker

2013-06-18 12:11:42 EDT

(In reply to comment #21)
> I'm not sure what you mean by the
> https://bugs.eclipse.org/bugs/show_bug.cgi?id=409427#c17 comment?

Well, in looking at the binary files, there was a lot of repeated text -- when the text was part of different object instances. That seems to miss a significant opportunity for compression. For instance, in our data a lot of the comments on reviews are machine generated boiler-plate.

> I wonder if you were you testing the new option in combination with using the
> ZIP option?

Wait, what zip option? I don't see anything like that in BinaryResourceImpl. :)

(By the way, Mylyn does compress offline task data with zip.)

Comment 23 Ed Merks

2013-06-18 19:38:01 EDT

If you set a breakpoint after the guards in this method, do you really see it write the same string more than once (and do you see it writing long strings, rather parts of longer strings)?

    public void writeString(String value) throws IOException
    {
      if (value == null)
      {
        writeCompressedInt(-1);
      }
      else
      {
        if (segmentToIDMap != null)
        {
          Integer id = segmentToIDMap.get(value);
          if (id != null)
          {
            writeCompressedInt(id);
            return;
          }
          else
          {
            int idValue = segmentToIDMap.size();
            segmentToIDMap.put(value, idValue);
            writeCompressedInt(idValue);
          }
        }

        int length = value.length();

There's a zip option for Resource itself and it's possible to use a URI that writes into a zip file, but I don't imagine this should save significant space.

Comment 24 Ed Merks

2013-06-19 06:51:32 EDT

Another question that strikes me is what you mean by "across objects".  The tabulated values are mapped for the lifetime of the resource save, so that should be sufficient.

Comment 25 Miles Parker

2013-06-19 14:07:46 EDT

Now, this is more like it! After tracing through the code, I realized that there was nothing happening because I'd neglected to set BinaryResourceImpl version to BinaryResourceImpl.BinaryIO.Version.VERSION_1_1. With that change in place I saw an improvement much more in line w/ what I'd seen before (e.g. w/ my Butterflyzer project).

For a bunch of reviews w/o patch sets (smaller, less opportunity for packing):

Binary		w/Option
16.4MB		2.7MB

For a single very large review:

Binary		w/Option
15.1MB		1.5MB

That's not a misplaced decimal. :D

So this is another strong data point for considering updating our EMF dependencies to 2.9 for Luna. (My guess is that that would be too disruptive to even consider for one of the Kepler SRs, but Steffen can speak to that if he wishes.)

I've updated https://git.eclipse.org/r/#/c/13856/ so we can track this possibility. (Re) Closing the actual bug because integration test failures appear to be unrelated.

Comment 26 Miles Parker

2013-06-19 14:08:30 EDT

(Note that the build obviously fails on the change because we don't have the 2.9 dependencies.)