Bug 573660 - JGit clone time takes ages, because of PackWriter.searchForReuse()
Summary: JGit clone time takes ages, because of PackWriter.searchForReuse()
Status: NEW
Alias: None
Product: JGit
Classification: Technology
Component: JGit (show other bugs)
Version: 5.11   Edit
Hardware: All Linux
: P3 major (vote)
Target Milestone: ---   Edit
Assignee: Project Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on: 573675
Blocks:
  Show dependency tree
 
Reported: 2021-05-20 05:21 EDT by Luca Milanesio CLA
Modified: 2021-06-03 11:18 EDT (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luca Milanesio CLA 2021-05-20 05:21:30 EDT
The issue was raised a few months ago on the Gerrit mailing list [1] but no official JGit bugs were filed at that time.

I'm reporting here a short summary of the problem, for traceability.

I have a situation where *certain* Git fetches are taking 10x to 100x of the expected execution time. See the graph at [2].

The blue dots represent the total execution time of the upload-pack, whilst the red dots represent the time for searching for objects reuse in existing packfiles.

The normal expected execution time is the blue band around 10s. The slower execution times are the one over 100s and the *super slow* execution times are the ones coming up to 1000s.
The repository is *exactly the same* for all those upload-pack executions.

Looking at the code called by the searchForReuse, I ended up in:

	@Override
	public void selectObjectRepresentation(PackWriter packer,
			ProgressMonitor monitor, Iterable<ObjectToPack> objects)
			throws IOException, MissingObjectException {
		for (ObjectToPack otp : objects) {
			db.selectObjectRepresentation(packer, otp, this);
			monitor.update(1);
		}
	}


 The above is a cycle for *all* objects that goes into the another scan for *all* packfiles inside the selectObjectRepresentation().

The slow clones were going through 2M of objects on a repository with 4k packfiles … the math would say that it went through a nested cycle of 2M x 4k => 8BN of operations.
I am not surprised it is slow after all :-)

So, it looks like it works the way it is designed: very very slowly.

My questions on the above are:
1. Is there anyone else in the world, using Gerrit or JGit, with the same problem?
2. How to disable the search for reuse? (Even if I disable the reuseDelta or reuseObjects in the [pack] section of the gerrit.config, the searchForReuse() phase is trigged anyway)
3. Would it make sense to estimate the combination explosion of the phase beforehand (it is simple: just multiply the number of objects x number of packfiles) and automatically disable that phase?

P.S. Of course, having 4k packfiles is an issue and we should aim to have at maximum 2 packfiles or even just one. However, in real terms, busy repos will always end up with at least a few hundreds of packfiles and optimising the search for reuse algorithm would benefit them all.

Thank you in advance for your comments, having having read the ones at [1].

[1] https://groups.google.com/g/repo-discuss/c/xccn2CbLlBs/m/8Rz9-IxiBgAJ
[2] https://groups.google.com/group/repo-discuss/attach/65f4c17810110/Screenshot%202020-10-01%20at%2022.41.17.png?part=0.1.1&view=1
Comment 1 Matthias Sohn CLA 2021-05-20 18:14:50 EDT
Performance of object lookup degrades with growing number of packfiles since
pack indexes are stored per packfile and multiple indexes may have to be inspected for an object lookup.

This could be improved with multi-pack-index and a multi-pack-index based reachability bitmap. See bug 573675.
Comment 2 Luca Milanesio CLA 2021-05-21 11:42:59 EDT
> This could be improved with multi-pack-index and a multi-pack-index based reachability bitmap. See bug 573675.

Yes, I was thinking about the same thing.
We will look into that.
Comment 3 Fabio Ponciroli CLA 2021-06-01 10:37:37 EDT
In addition to Luca's comment, "selectRepresentation" is, by design, searching in all the pack files in reverse chronological order.

Once a candidate is found the iteration is not short-circuited, but it continues analysing all the packfiles, looking for the best candidate.

"PackWriter::select" is responsible to appoint the best candidate. Here a snippet of the "select" javadoc:  "An org.eclipse.jgit.lib.ObjectReader implementation should invoke this method once for each representation available for an object, to allow the writer to find the most suitable one for the output."

What is not clear to me yet is:
1) what is the definition of "most suitable object"
2) can we just use a "good enough" object, when forecasting an expensive search for reuse phase?
Comment 4 Fabio Ponciroli CLA 2021-06-01 12:26:46 EDT
I wrote the following test that shows the slow down Luca was talking about:

    WindowCursor repo1Cur = new WindowCursor(db.getObjectDatabase());`
    GC externalGCOnRepo = new GC(db);
    externalGCOnRepo.setExpireAgeMillis(1000000);
    externalGCOnRepo.setPackExpireAgeMillis(1000000);
    
    ObjectId id = commitFile("file.txt", "test", "master").getId();             
    for (int i = 0; i < 100; i++) {
	externalGCOnRepo.gc();
	commitFile("file" + i +".txt", "test " + i, "master").getId();
    }

    for (int i = 0; i < 100; i++) {
	commitFile("another-file" + i + ".txt", "test " + i, "master").getId();
    }

    System.out.println("Start GC");
    long start_time = System.currentTimeMillis();
    externalGCOnRepo.gc();
    long end_time = System.currentTimeMillis();
    double difference = (end_time - start_time);
    System.out.println("End GC in " + difference);

The last GC took around 13000ms without any change to the code. However, if I comment out the PackWrite::searchForReuse the GC took around 300ms !?

I guess it was probably expected, but Luca's suggestion about disabling the search for reuse phase when we estimate a potential costly computation seems like a good idea.

Am I missing anything in the picture? Does the PackWrite::searchForReuse have some side effects I didn't consider, hence it cannot be excluded?