545145 – Major performance regression in ChecksumProducer

Bug 545145 - Major performance regression in ChecksumProducer

Summary: Major performance regression in ChecksumProducer

Status:	CLOSED WONTFIX

Alias:	None

Product:	Equinox
Classification:	Eclipse Project
Component:	p2 (show other bugs)
Version:	4.11
Hardware:	All All

Importance:	P3 normal (vote)
Target Milestone:	---
Assignee:	Mykola Nikishov
QA Contact:

URL:
Whiteboard:	stalebug
Keywords:	performance

Depends on:
Blocks:	545159
	Show dependency tree

Reported:	2019-03-06 14:56 EST by Sebastian Ratz
Modified:	2022-02-18 08:11 EST (History)
CC List:	6 users (show)

See Also:	Gerrit Change 545159

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Sebastian Ratz

2019-03-06 14:56:06 EST

In Bug 423715, support for multiple checksum algorithms was added.

This introduced a major performance impact due to inefficient handling of the InputStream:
https://git.eclipse.org/r/#/c/69560/32/bundles/org.eclipse.equinox.p2.repository/src/org/eclipse/equinox/internal/p2/repository/helpers/ChecksumProducer.java

The old code used

    while ((read = fis.read(buffer, 0, bufferSize)) != -1) {

to read in 4k chunks, while the new code uses

    while (fis.read(buffer) != -1) {

which reads byte by byte.

This is extremely inefficient.

Comment 1 Eclipse Genie

2019-03-06 14:57:14 EST

New Gerrit change created: https://git.eclipse.org/r/138216

Comment 2 Alexander Kurtakov

2019-03-06 14:58:43 EST

Do you have any numbers showing how much your patch improves the situation?

Comment 3 Sebastian Ratz

2019-03-06 15:30:22 EST

Sorry, I actually didn't read the code correctly.

I though it called read(), not not read(buffer).

So all is good here.

Looks like the reason is just that SHA-256 is way slower than MD5 :/

Comment 4 Alexander Kurtakov

2019-03-06 15:32:51 EST

Can the reason be that both checksums are calculated? I would be fine with skipping md5 calculation if sha is there . WDYT? Ready to try it out?

Comment 5 Sebastian Ratz

2019-03-07 03:51:21 EST

Yes, creating both checksums causes the file to be read twice, so that's already a factor of two.

We could introduce a custom MultiDigestInputStream that is able to update the digest for multiple algorithms in one go while only reading the file once.

Is disabling of MD5 just like that reasonable?

Comment 6 Alexander Kurtakov

2019-03-07 03:55:29 EST

(In reply to Sebastian Ratz from comment #5)
> Yes, creating both checksums causes the file to be read twice, so that's
> already a factor of two.
> 
> We could introduce a custom MultiDigestInputStream that is able to update
> the digest for multiple algorithms in one go while only reading the file
> once.
> 
> Is disabling of MD5 just like that reasonable?

It's not disabling it like that it's when the stronger checksum (sha256) is checked, don't bother with md5. If there is no sha256 - md5 should still be checked.

Comment 7 Mykola Nikishov

2019-03-07 06:43:29 EST

(In reply to Sebastian Ratz from comment #5)
> Yes, creating both checksums causes the file to be read twice, so that's
> already a factor of two.
If this logic is correct, p2 would read the file not twice but 4+ times.

Check o.e.e.internal.provisional.p2.artifact.repository.processing.ProcessingStep
- p2 reads artifact bytes only once and applyes a multiple ProcessingSteps
by chaining them.

Comment 8 Sebastian Ratz

2019-03-07 06:48:31 EST

This is about the production of the checksum.

See
org.eclipse.equinox.internal.p2.artifact.processors.checksum.ChecksumUtilities.calculateChecksums(File, Map<String, String>, Collection<String>)

It calls
ChecksumProducer.produce(pathOnDisk, algorithm)

for every algorithm which each time instantiates a new InputStream.

Comment 9 Mykola Nikishov

2019-03-07 06:53:44 EST

(In reply to Mykola Nikishov from comment #7)
Please disregard, I constantly mixing checksum generation/verification ;-)

Comment 10 Mykola Nikishov

2019-03-07 06:56:21 EST

org.eclipse.equinox.internal.p2.repository.helpers.ChecksumProducer.produce(File, String) should use ProcessingSteps approach and not read file every single time for each MessageDigest supported.

Comment 11 Mickael Istria

2019-03-07 15:02:00 EST

I think we really need some A/B testing here to sort out
1. what's the actual performance drop factor (is it a constant factor, a N-power-factor...), and see whether it's an algorithmic issue or just some useless steps to remove
2. How many times we invoke the ChecksumProducer.produce(...) methods and whether they do match expectation
3. Whether it's specifically the SHA-256 that's taking long, and if so, why?

In bug 545159, you mention 30+ minutes for 300 bundles IIRC. But here locally with `sha256sum` command, I can generate the sha-256 checksum of the 685 plugins of a local Eclipse instance in less than a second. I would expect similar order for SHA-256 computed with Java. If the SHA-256 computation itself is the bottleneck, then it narrows down where to improve things.

Comment 12 Thomas Watson

2019-11-19 09:33:37 EST

Dropping milestone until someone agrees to progress this issue for a release.

Comment 13 Ed Merks

2020-02-28 13:33:34 EST

So I think this issue boils down to:  The computation could be made (much?) faster if the file were read only once.

That doesn't sound critical to me, so I'll reduce the severity.

Comment 14 Eclipse Genie

2022-02-18 08:11:26 EST

This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're closing this bug.

If you have further information on the current state of the bug, please add it and reopen this bug. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

--
The automated Eclipse Genie.