Bug 545145 - Major performance regression in ChecksumProducer
Summary: Major performance regression in ChecksumProducer
Status: CLOSED WONTFIX
Alias: None
Product: Equinox
Classification: Eclipse Project
Component: p2 (show other bugs)
Version: 4.11   Edit
Hardware: All All
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Mykola Nikishov CLA
QA Contact:
URL:
Whiteboard: stalebug
Keywords: performance
Depends on:
Blocks: 545159
  Show dependency tree
 
Reported: 2019-03-06 14:56 EST by Sebastian Ratz CLA
Modified: 2022-02-18 08:11 EST (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sebastian Ratz CLA 2019-03-06 14:56:06 EST
In Bug 423715, support for multiple checksum algorithms was added.

This introduced a major performance impact due to inefficient handling of the InputStream:
https://git.eclipse.org/r/#/c/69560/32/bundles/org.eclipse.equinox.p2.repository/src/org/eclipse/equinox/internal/p2/repository/helpers/ChecksumProducer.java

The old code used

    while ((read = fis.read(buffer, 0, bufferSize)) != -1) {

to read in 4k chunks, while the new code uses

    while (fis.read(buffer) != -1) {

which reads byte by byte.

This is extremely inefficient.
Comment 1 Eclipse Genie CLA 2019-03-06 14:57:14 EST
New Gerrit change created: https://git.eclipse.org/r/138216
Comment 2 Alexander Kurtakov CLA 2019-03-06 14:58:43 EST
Do you have any numbers showing how much your patch improves the situation?
Comment 3 Sebastian Ratz CLA 2019-03-06 15:30:22 EST
Sorry, I actually didn't read the code correctly.

I though it called read(), not not read(buffer).

So all is good here.

Looks like the reason is just that SHA-256 is way slower than MD5 :/
Comment 4 Alexander Kurtakov CLA 2019-03-06 15:32:51 EST
Can the reason be that both checksums are calculated? I would be fine with skipping md5 calculation if sha is there . WDYT? Ready to try it out?
Comment 5 Sebastian Ratz CLA 2019-03-07 03:51:21 EST
Yes, creating both checksums causes the file to be read twice, so that's already a factor of two.

We could introduce a custom MultiDigestInputStream that is able to update the digest for multiple algorithms in one go while only reading the file once.

Is disabling of MD5 just like that reasonable?
Comment 6 Alexander Kurtakov CLA 2019-03-07 03:55:29 EST
(In reply to Sebastian Ratz from comment #5)
> Yes, creating both checksums causes the file to be read twice, so that's
> already a factor of two.
> 
> We could introduce a custom MultiDigestInputStream that is able to update
> the digest for multiple algorithms in one go while only reading the file
> once.
> 
> Is disabling of MD5 just like that reasonable?

It's not disabling it like that it's when the stronger checksum (sha256) is checked, don't bother with md5. If there is no sha256 - md5 should still be checked.
Comment 7 Mykola Nikishov CLA 2019-03-07 06:43:29 EST
(In reply to Sebastian Ratz from comment #5)
> Yes, creating both checksums causes the file to be read twice, so that's
> already a factor of two.
If this logic is correct, p2 would read the file not twice but 4+ times.

Check o.e.e.internal.provisional.p2.artifact.repository.processing.ProcessingStep
- p2 reads artifact bytes only once and applyes a multiple ProcessingSteps
by chaining them.
Comment 8 Sebastian Ratz CLA 2019-03-07 06:48:31 EST
This is about the production of the checksum.

See
org.eclipse.equinox.internal.p2.artifact.processors.checksum.ChecksumUtilities.calculateChecksums(File, Map<String, String>, Collection<String>)

It calls
ChecksumProducer.produce(pathOnDisk, algorithm)

for every algorithm which each time instantiates a new InputStream.
Comment 9 Mykola Nikishov CLA 2019-03-07 06:53:44 EST
(In reply to Mykola Nikishov from comment #7)
Please disregard, I constantly mixing checksum generation/verification ;-)
Comment 10 Mykola Nikishov CLA 2019-03-07 06:56:21 EST
org.eclipse.equinox.internal.p2.repository.helpers.ChecksumProducer.produce(File, String) should use ProcessingSteps approach and not read file every single time for each MessageDigest supported.
Comment 11 Mickael Istria CLA 2019-03-07 15:02:00 EST
I think we really need some A/B testing here to sort out
1. what's the actual performance drop factor (is it a constant factor, a N-power-factor...), and see whether it's an algorithmic issue or just some useless steps to remove
2. How many times we invoke the ChecksumProducer.produce(...) methods and whether they do match expectation
3. Whether it's specifically the SHA-256 that's taking long, and if so, why?

In bug 545159, you mention 30+ minutes for 300 bundles IIRC. But here locally with `sha256sum` command, I can generate the sha-256 checksum of the 685 plugins of a local Eclipse instance in less than a second. I would expect similar order for SHA-256 computed with Java. If the SHA-256 computation itself is the bottleneck, then it narrows down where to improve things.
Comment 12 Thomas Watson CLA 2019-11-19 09:33:37 EST
Dropping milestone until someone agrees to progress this issue for a release.
Comment 13 Ed Merks CLA 2020-02-28 13:33:34 EST
So I think this issue boils down to:  The computation could be made (much?) faster if the file were read only once.

That doesn't sound critical to me, so I'll reduce the severity.
Comment 14 Eclipse Genie CLA 2022-02-18 08:11:26 EST
This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're closing this bug.

If you have further information on the current state of the bug, please add it and reopen this bug. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

--
The automated Eclipse Genie.