Bug 477007 - Artifact and Metadata repositories should not use whitespace
Summary: Artifact and Metadata repositories should not use whitespace
Status: RESOLVED WONTFIX
Alias: None
Product: Equinox
Classification: Eclipse Project
Component: p2 (show other bugs)
Version: 4.5.0 Mars   Edit
Hardware: All All
: P3 enhancement (vote)
Target Milestone: 4.18   Edit
Assignee: Alex Blewitt CLA
QA Contact:
URL:
Whiteboard:
Keywords: noteworthy
: 432654 (view as bug list)
Depends on:
Blocks:
 
Reported: 2015-09-09 17:10 EDT by Alex Blewitt CLA
Modified: 2020-11-10 03:37 EST (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Blewitt CLA 2015-09-09 17:10:43 EDT
When the artifact and metadata repositories are written with the XMLWriter in org.eclipse.equinox.internal.p2.persistence with indentation (generated by the indent method)

https://github.com/eclipse/rt.equinox.p2/blob/master/bundles/org.eclipse.equinox.p2.repository/src/org/eclipse/equinox/internal/p2/persistence/XMLWriter.java#L271

Although compression largely gets rid of the space in the compressed formats, it is still smaller still not to generate indentation. It also improves performance since the XML parser doesn't have to skip whitepsace when parsing the files; when starting up, for example, Eclipse will read the artifacts.xml file in uncompressed format. On my local Eclipse 4.5 install (which isn't heavily loaded) this is the size difference that it can make:

-rw-r--r--@ 1  180626  9 Sep 22:06 artifacts-no-space.xml
-rw-r--r--@ 1  211754  9 Sep 22:04 artifacts.xml

On a 512-block system, it's a difference of 352 blocks vs 413 to read the data off disk.

This is with both whitespace and newlines removed; a configuration property to allow the data to be written without whitespace and newlines could be passed in, and then have the printwriters add indentation where applicable.
Comment 1 Ed Merks CLA 2020-02-19 01:01:28 EST
*** Bug 432654 has been marked as a duplicate of this bug. ***
Comment 2 Lars Vogel CLA 2020-02-24 03:36:34 EST
Alex, could you provide a  Gerrit for this?
Comment 3 Lars Vogel CLA 2020-06-24 08:08:03 EDT
(In reply to Lars Vogel from comment #2)
> Alex, could you provide a  Gerrit for this?

Ping?
Comment 4 Eclipse Genie CLA 2020-06-26 17:22:27 EDT
New Gerrit change created: https://git.eclipse.org/r/165521
Comment 6 Lars Vogel CLA 2020-07-02 10:14:15 EDT
Thanks, Alex. Please set milestone and update bug status, if this work is done.
Comment 7 Lars Vogel CLA 2020-07-07 06:25:29 EDT
Alex, can you add this to the N&N and also describe the p2.useWhitespace option in this entry? I guess some people would like to make the generated files readable for debugging purpose.
Comment 8 Alex Blewitt CLA 2020-07-07 09:27:38 EDT
Sure, I can add a N&N once we've bedded down RC1 and we're happy with how we're generating the update sites. I wasn't sure if the code for generating the sites hosted on Eclipse would use the nightly builds, or if they would use a stable version (so we might not have seen it in use yet).

Given that they are XML files, you can pipe them through either xmltidy or open them up in the browser; the browser's built-in CSS for navigating XML file should show them as navigable, even if the underlying source isn't.
Comment 9 Ed Merks CLA 2020-08-05 01:30:00 EDT
I'm quite unhappy with this change. 

Oomph uses the XMLWriter to present information about IUs to the user in the Repository Explorer, and that feature has now become unusable.  

Firstly because it's unreadable and secondly because it's one giant long string, presenting that in StyledText performs so badly that the UI freezes for more than a minute in some cases.  Of course setting the text and styling it must be done in the UI thread, so there's no avoiding that.

These static final constants were added.  

org.eclipse.equinox.internal.p2.persistence.XMLWriter.useWhitespace
org.eclipse.equinox.internal.p2.update.XMLWriter.useWhitespace

The naming is already poor because final constants should be all upper case.

I would like to change each of these to:

	protected boolean useWhitespace = Boolean.getBoolean("p2.useWhitespace"); //$NON-NLS-1$

This way a client can change the value without having global impact.

I suppose even private would be okay because I could set and restore the system property, though that's pretty questionable as well. 

But the current design of a global final singleton value is not workable.
Comment 10 Lars Vogel CLA 2020-08-05 03:57:26 EDT
(In reply to Ed Merks from comment #9)
> I'm quite unhappy with this change. 
> 
> Oomph uses the XMLWriter to present information about IUs to the user in the
> Repository Explorer, and that feature has now become unusable.  
> 
> Firstly because it's unreadable and secondly because it's one giant long
> string, presenting that in StyledText performs so badly that the UI freezes
> for more than a minute in some cases.  Of course setting the text and
> styling it must be done in the UI thread, so there's no avoiding that.
> 
> These static final constants were added.  
> 
> org.eclipse.equinox.internal.p2.persistence.XMLWriter.useWhitespace
> org.eclipse.equinox.internal.p2.update.XMLWriter.useWhitespace
> 
> The naming is already poor because final constants should be all upper case.

Can be changed, as they have not found their way into a release. Please provide Gerrit.

> I would like to change each of these to:
> 
> 	protected boolean useWhitespace = Boolean.getBoolean("p2.useWhitespace");
> //$NON-NLS-1$
> 
> This way a client can change the value without having global impact.

Sounds good to me, please upload a Gerrit.
Comment 11 Alex Blewitt CLA 2020-08-05 06:11:21 EDT
(In reply to Ed Merks from comment #9)
> I'm quite unhappy with this change. 
> 
> Oomph uses the XMLWriter to present information about IUs to the user in the
> Repository Explorer, and that feature has now become unusable.  
> 
> Firstly because it's unreadable and secondly because it's one giant long
> string, presenting that in StyledText performs so badly that the UI freezes
> for more than a minute in some cases.  Of course setting the text and
> styling it must be done in the UI thread, so there's no avoiding that.
> 
> These static final constants were added.  
> 
> org.eclipse.equinox.internal.p2.persistence.XMLWriter.useWhitespace
> org.eclipse.equinox.internal.p2.update.XMLWriter.useWhitespace
> 
> The naming is already poor because final constants should be all upper case.
> 
> I would like to change each of these to:
> 
> 	protected boolean useWhitespace = Boolean.getBoolean("p2.useWhitespace");
> //$NON-NLS-1$
> 
> This way a client can change the value without having global impact.
> 
> I suppose even private would be okay because I could set and restore the
> system property, though that's pretty questionable as well. 
> 
> But the current design of a global final singleton value is not workable.

Why does Oomph use a P2-specific internal XML writer for this, instead of using its own mechanism?
Comment 12 Ed Merks CLA 2020-08-05 06:20:50 EDT
Why does anyone reuse any software rather than rewriting the whole thing from scratch themselves?

The underlying thinking behind this change seems to be that no human ever looks at these things so who cares that it's now become one giant unreadable line of gibberish.  Certainly no one will be able to read it now and if no one needs to read it ever, why bother writing it in XML in the first place?

I also question to change the default behavior rather than allowing for opt-in as opposed to forcing opt-out?
Comment 13 Alex Blewitt CLA 2020-08-05 06:23:19 EDT
It's an internal implementation detail of the P2 persistence mechanism. It shouldn't be used by those outside of P2. If you want to have an Oomph specific XMLWriter that you have control of, it would make sense to use that instead of having an internal dependency on something that may break/change without notice.
Comment 14 Ed Merks CLA 2020-08-05 07:23:17 EDT
(In reply to Alex Blewitt from comment #13)
> It's an internal implementation detail of the P2 persistence mechanism. It
> shouldn't be used by those outside of P2. If you want to have an Oomph
> specific XMLWriter that you have control of, it would make sense to use that
> instead of having an internal dependency on something that may break/change
> without notice.

API-rights activists discussions will get us nowhere.  Yes, these classes are internal, but unfortunately p2 is so rife with internals that one can't actually implement useful utilities on top of it without using internals. If there were a toXML utility API somewhere, that would be great, but there isn't.

If you look at the history of org.eclipse.equinox.internal.p2.persistence.XMLWriter, you'll see how it evolved to support generic requirements (Bug 528387) and that should make it clear that duplicating this code base is also not a solution to changing implementation details in p2.  It just means that I would have to keep duplicate code in lock-step just to appease the API-rights activists.
Comment 15 Lars Vogel CLA 2020-08-05 08:13:45 EDT
I think you both are right (Alex with we should not use internal API and Ed with sometimes it is just practical impossible to do not use internals).

I think Ed parameter suggestion does not do any harm so if you could provide a Gerrit that should be fine. 

Ed, please open also a bug to request a toXML API so that this can be used to official usage.
Comment 16 Ed Merks CLA 2020-08-05 08:19:23 EDT
I can provide a patch with the suggested changes for review.

Note that I was just suggesting to make protected non-final variables (initialize the same way as now) instead.  Okay?
Comment 17 Lars Vogel CLA 2020-08-05 08:33:15 EDT
(In reply to Ed Merks from comment #16)
> I can provide a patch with the suggested changes for review.
> 
> Note that I was just suggesting to make protected non-final variables
> (initialize the same way as now) instead.  Okay?

Fine for me.
Comment 18 Ed Merks CLA 2020-08-09 01:24:55 EDT
The more I look at and think about the impact of this change, it really bothers me a lot.

I know for sure that this will break long-established processing steps that have been written based on the fact that elements are on separate lines, e.g.,

https://git.eclipse.org/c/oomph/org.eclipse.oomph.git/tree/releng/org.eclipse.oomph.releng/src/ArtifactRepositoryAdjuster.java

So I look at how I could pass a system property to the tycho-p2-repository-plugin to turn this new behavior off. But I don't see how I can do that. 

I believe this change also impacts org.eclipse.equinox.internal.p2.persistence.CompositeWriter because it extends org.eclipse.equinox.internal.p2.persistence.XMLWriter and I'm sure that many people edit the composite manually or via scripts, though perhaps no so many applications actually create the composite using p2...

I really don't think the new behavior should be the new default.  I think new options such as this should be opt-in not opt-out.  But that begs the question of how to enable the opt-in behavior in things like tycho-p2-repository-plugin.  I believe that has been given no thought.

I think the change would be less likely to be disruptive if linefeeds were not eliminated, but rather only the indentation was eliminated. That way at least the XML is less readable rather than unreadable.  Certainly Eclipse editors *do not* like (does not properly display at least on Windows) lines that are several megabytes long.
Comment 19 Lars Vogel CLA 2020-08-10 04:26:04 EDT
Alexander, WDYT?
Comment 20 Alex Blewitt CLA 2020-08-11 14:15:07 EDT
(In reply to Ed Merks from comment #18)
> The more I look at and think about the impact of this change, it really
> bothers me a lot.
> 
> I know for sure that this will break long-established processing steps that
> have been written based on the fact that elements are on separate lines,
> e.g.,
> 
> https://git.eclipse.org/c/oomph/org.eclipse.oomph.git/tree/releng/org.
> eclipse.oomph.releng/src/ArtifactRepositoryAdjuster.java

Since it's an XML file, the processing should be done using XML tools, not line/text/regex oriented ones. It's good that we're catching such bugs and can fix them. In this case, it would seem that XPath would be the right candidate for determining the greatest version, and the Transformer would be appropriate for re-writing attributes. Then again, it's such a simple transformation that using a SAX parser would do the trick as well, since you'd only have to process an attribute.

https://docs.oracle.com/javase/7/docs/api/javax/xml/xpath/XPath.html
https://docs.oracle.com/javase/6/docs/api/javax/xml/transform/Transformer.html

> So I look at how I could pass a system property to the
> tycho-p2-repository-plugin to turn this new behavior off. But I don't see
> how I can do that.

Why should anyone using the artifacts/repository files care whether they're compressed or not? The APIs exist to be able to read/write them already, and we already have XML processing tools. In fact, a lot of work in Eclipse (e.g. EMF, E4) works almost exclusively with XML files.

> I believe this change also impacts
> org.eclipse.equinox.internal.p2.persistence.CompositeWriter because it
> extends org.eclipse.equinox.internal.p2.persistence.XMLWriter and I'm sure
> that many people edit the composite manually or via scripts, though perhaps
> no so many applications actually create the composite using p2...
> 
> I really don't think the new behavior should be the new default.  I think
> new options such as this should be opt-in not opt-out.  But that begs the
> question of how to enable the opt-in behavior in things like
> tycho-p2-repository-plugin.  I believe that has been given no thought.

No, because there wasn't the intention that this would have to be an opt-in change. We want to be able to write out more minimal files; having it as an opt-in property wouldn't make sense, because then you'd have to enable it at every build tool that might be using the P2 APIs. Having it opt-out was only there as a break-glass incident/access rather than an expected form of using the files.

> I think the change would be less likely to be disruptive if linefeeds were
> not eliminated, but rather only the indentation was eliminated. That way at
> least the XML is less readable rather than unreadable.  Certainly Eclipse
> editors *do not* like (does not properly display at least on Windows) lines
> that are several megabytes long.

But who is opening an editor on an XML file, and certainly the artifacts.xml/content.xml files, and why? I get that writing hand-held compositeContent/Artifacts.xml files might make sense, but in general they're pretty small in either case.

In the interest in moving forwards, perhaps a half-way house would be to re-enable the end-of-line markers for now (so they don't cause problems with those editing it) and would allow a transition for processing tools like ArtifactRepositoryAdjuster to be adjusted to do XML transformations instead of text ones? Then if necessary, a subsequent change could enable the stripping of newlines as well.
Comment 21 Ed Merks CLA 2020-08-12 01:13:30 EDT
> > So I look at how I could pass a system property to the
> > tycho-p2-repository-plugin to turn this new behavior off. But I don't see
> > how I can do that.
> 
> Why should anyone using the artifacts/repository files care whether they're
> compressed or not? The APIs exist to be able to read/write them already, and
> we already have XML processing tools. In fact, a lot of work in Eclipse
> (e.g. EMF, E4) works almost exclusively with XML files.
> 

I just explained why someone cares and I've pointed out two things that are broken or unusable as a result of the assumption that no one ought to care.

> > I believe this change also impacts
> > org.eclipse.equinox.internal.p2.persistence.CompositeWriter because it
> > extends org.eclipse.equinox.internal.p2.persistence.XMLWriter and I'm sure
> > that many people edit the composite manually or via scripts, though perhaps
> > no so many applications actually create the composite using p2...
> > 
> > I really don't think the new behavior should be the new default.  I think
> > new options such as this should be opt-in not opt-out.  But that begs the
> > question of how to enable the opt-in behavior in things like
> > tycho-p2-repository-plugin.  I believe that has been given no thought.
> 
> No, because there wasn't the intention that this would have to be an opt-in
> change. We want to be able to write out more minimal files; having it as an
> opt-in property wouldn't make sense, because then you'd have to enable it at
> every build tool that might be using the P2 APIs. Having it opt-out was only
> there as a break-glass incident/access rather than an expected form of using
> the files.
> 

I'm not sure who "we" are in this case. I would hope that "we" want to make changes that have maximal positive impact and minimal negative impact.


> > I think the change would be less likely to be disruptive if linefeeds were
> > not eliminated, but rather only the indentation was eliminated. That way at
> > least the XML is less readable rather than unreadable.  Certainly Eclipse
> > editors *do not* like (does not properly display at least on Windows) lines
> > that are several megabytes long.
> 
> But who is opening an editor on an XML file, and certainly the
> artifacts.xml/content.xml files, and why? I get that writing hand-held
> compositeContent/Artifacts.xml files might make sense, but in general
> they're pretty small in either case.
> 

Have you never needed to understand what p2 ends up publishing and why something isn't ending up in a repository?  Certainly the repository explorer tries to eliminate the need to do this manually, but assuming no human ever does this is a bad arbitrary assumption.

> In the interest in moving forwards, perhaps a half-way house would be to
> re-enable the end-of-line markers for now (so they don't cause problems with
> those editing it) and would allow a transition for processing tools like
> ArtifactRepositoryAdjuster to be adjusted to do XML transformations instead
> of text ones? Then if necessary, a subsequent change could enable the
> stripping of newlines as well.

In the interest of gatherings facts, I compared the size of this repo's content jar:

http://download.eclipse.org/releases/2020-09/202008071000/

with the size of what's produced now.  The saving is roughly 7% so that seems fairly significant.

But then I though, what about the *.xml.xz?  Unfortunately the mirror application doesn't support using org.eclipse.equinox.p2.internal.repository.tools.XZCompressor for that purpose.  In fact *nothing* in the platform in actually uses this class!  So I guess only Tycho uses it, but it's internal! Certainly I use it for producing EMF's update sites and for producing JustJ's update sites, but then I'm a bad person because I use p2 internal "APIs".  Of course the fact that there no way to produce the *.xml.xz without using internals kind of loops back into a pointlessly circular API discussion.

In any case, if I modify the MirrorApplication to use this class, the process of compressing the XML does not complete for many minutes.  In fact, it takes so long that I have lost patience waiting for it to complete after more than 10 minutes.

If I disable the whitespace "optimization" via the system property, the processing completes quickly as normal and the size difference is 814K versus 817K which is less than .4% size savings.

Clearly the unacceptable performance of XZCompressor for a single very long line of XML is a disaster waiting to happen.  And this will not be noticed except by downstream consumers using p2 internal APIs, which is the only "API" for producing a p2.index and the *.xml.xz variants.

In my opinion, this change should simply be reverted.  It does not significantly improve the size of the *.xml.xz and its definitely going to break clients, including Oomph, EMF, JustJ and Tycho.

If the goal is to improve p2 performance we should rather ensure that projects all produce *.xml.xz rather than trying to make the content.jar smaller.  Fortunately that is Tychos's default:

https://www.eclipse.org/tycho/sitedocs/tycho-p2/tycho-p2-repository-plugin/assemble-repository-mojo.html#xzCompress
Comment 22 Eike Stepper CLA 2020-08-13 01:26:49 EDT
I'm also either for reverting this change or for making it opt-in.

My project's build and promotion system uses line-by-line processing of p2 files since more than 10 years. Whether one likes that or not, it works and is fast. The size reduction through whitespace removal is less than 12 KB per content.jar.

In addition I personally have to look at the inner details of these files on almost a daily basis. And of course I use to use my machine's text editor for this purpose.

(In reply to Alex Blewitt from comment #0)
> On a 512-block system, it's a difference of 352 blocks vs 413 to read the
> data off disk.

Even for 1000 copies of this file that would save about 30 MB of disk space. The cost of this space can not be expressed in cents, I guess.

Size reduction is probably more interesting when it comes to (repeated) network transfers.

Also, I haven't seen any processing time benchmarks/comparisons which would justify this non-optional change.

> [...] a configuration property
> to allow the data to be written without whitespace and newlines could be
> passed in, and then have the printwriters add indentation where applicable.

This sounded more like an opt-in approach.
Comment 23 Alexander Kurtakov CLA 2020-08-13 01:33:48 EDT
I'll revert the change as we are really close to M3 and I won't be available to review new patch prior to M3. Alex, if you enhance it to be opt-in I'll review but it will probably be for the December release.
Comment 24 Eclipse Genie CLA 2020-08-13 01:40:38 EDT
New Gerrit change created: https://git.eclipse.org/r/c/equinox/rt.equinox.p2/+/167636
Comment 25 Ed Merks CLA 2020-08-13 01:43:22 EDT
Today I tried to measure better what was going wrong with XZCompressor performance and today it did not take exceedingly long with the whitespace missing.  I'm not sure what was going on yesterday.  Perhaps some Windows anti-virus behavior...  :-(

In any case, the 814K content.xml.xz is reduced to 793K, so less than a 3.5% reduction.

I just don't believe this saving justifies the reduced human readability.  Just last week when working on JustJ trying to generate an "a.jre.*" IU (for use in a Tycho build as an environment) using a p2.inf, I was constantly looking at the p2 metadata of the repo and even the target/p2content.xml (generated by Tycho using XMLWriter) so I could see what the publisher is actually publishing.  This is when I noticed the repository explorer performs poorly now and that led to finding the changes in this Bugzilla as the cause...
Comment 26 Eclipse Genie CLA 2020-08-13 03:40:59 EDT
New Gerrit change created: https://git.eclipse.org/r/c/equinox/rt.equinox.p2/+/167639
Comment 27 Alex Blewitt CLA 2020-08-13 04:06:10 EDT
Ed/Eike, can you file bugs that highlight the areas that you have found where tools have been written assuming a particular layout of the XML file so they can be fixed? Feel free to assign them to me.
Comment 29 Alexander Kurtakov CLA 2020-08-13 04:21:14 EDT
Alex, with the change being made opt-in now - it would need a way to tell tycho to opt-in. Are you interested in this?
Comment 30 Alex Blewitt CLA 2020-08-13 04:49:57 EDT
Yes, we'll need a way of making sure that the builds hosted at Eclipse.org produce the compact repositories by default, but I think that some of the processing in Oomph as highlighted by Ed needs to be resolved before we can make it so.

Either way, there should be a follow up to make this the default either for all P2 users (and allow Oomph to opt out where necessary) or for making the Tycho builds opt-in. Do you think it would make sense for that bug to live in P2 so it can co-ordinate users of P2 code?
Comment 31 Alexander Kurtakov CLA 2020-08-13 04:53:19 EDT
(In reply to Alex Blewitt from comment #30)
> Yes, we'll need a way of making sure that the builds hosted at Eclipse.org
> produce the compact repositories by default, but I think that some of the
> processing in Oomph as highlighted by Ed needs to be resolved before we can
> make it so.
> 
> Either way, there should be a follow up to make this the default either for
> all P2 users (and allow Oomph to opt out where necessary) or for making the
> Tycho builds opt-in. Do you think it would make sense for that bug to live
> in P2 so it can co-ordinate users of P2 code?

It's fine IMO.
Comment 32 Alex Blewitt CLA 2020-08-14 04:29:04 EDT
Raised bug 566070 to investigate surrounding issues. Ed, can you mark any bugs you raise regarding Oomph etc as blockers on that bug?
Comment 33 Thomas Watson CLA 2020-08-27 11:01:06 EDT
I'm moving this out to 4.18 although my vote would be to not bother with this at all and keep the human readable whitespace.
Comment 34 Alexander Kurtakov CLA 2020-11-10 03:37:29 EST
Let's say there is bigger fish to catch now.