[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [jgit-dev] Gerrit clone time takes ages, because of PackWriter.searchForReuse()
|
Thanks, Martin, for the prompt reply.
> On 1 Oct 2020, at 23:44, Martin Fick <mfick@xxxxxxxxxxxxxx> wrote:
>
> On Thursday, October 1, 2020 10:51:19 PM MDT Luca Milanesio wrote:
>> Looking at the code called by the searchForReuse, I ended up in:
>>
>> @Override
>> public void selectObjectRepresentation(PackWriter packer,
>> ProgressMonitor monitor, Iterable<ObjectToPack> objects)
>> throws IOException, MissingObjectException {
>> for (ObjectToPack otp : objects) {
>> db.selectObjectRepresentation(packer, otp, this);
>> monitor.update(1);
>> }
>> }
>>
>>
>> The above is a cycle for *all* objects that goes into the another scan for
>> *all* packfiles inside the selectObjectRepresentation().
>>
>> The slow clones were going through 2M of objects on a repository with 4k
>> packfiles … the math would say that it went through a nested cycle of 2M x
>> 4k => 8BN of operations. I am not surprised it is slow after all :-)
>
> Yes, it is terrible to have 4K pack files (or even 300) in a repo. It clearly
> needs to be repacked (but you knew that)!
Yeah, and the GC of the repo solves the problem … but that takes time (this is a *veeeery big* repo) and during the day pack files are piling up again.
So the problem is really *in between* GC cycles.
>
>> So, it looks like it works the way it is designed: very very slowly.
>>
>> My questions on the above are:
>> 1. Is there anyone else in the world, using Gerrit or JGit, with the same
>> problem?
>
> Well, I do think most people (especially you know) that you should expect a
> repo with 4K to perform atrociously! We have certainly experienced it and we
> avoid it!
I have to say that *IF* we sort out this problem, then it won’t be *SO bad* after all.
>
>> 2. How to disable the search for reuse? (Even if I disable the
>> reuseDelta or reuseObjects in the [pack] section of the gerrit.config, the
>> searchForReuse() phase is trigged anyway) 3. Would it make sense to
>> estimate the combination explosion of the phase beforehand (it is simple:
>> just multiply the number of objects x number of packfiles) and
>> automatically disable that phase?
>
> I don't think the search for reuse is technically the problem. I think the
> problem is not short circuiting the search when one it found? If I remember
> correctly, the loop searches for all the possibilities to attempt to find the
> best one. So I do believe that some mechanism to short circuit this is needed,
> not just for the degenerate case of a repo that has not been repacked.
>
> We have repos that we call siblings, they share objects via the alternatives
> mechanisms. They are different copies of the kernel/msm repo. Each copy
> points to the other copy via alternatives. When they get repacked, each one
> ensures that it has a copy of all the objects it references, and since there
> is a lot of shared history in these repos, the main objects are in many of
> these repos. In the past, I measured clones to take along the lines of 20%
> longer for each alternative than if there were no alternatives. I tracked it
> down to this same problem. I welcome a solution in this area.
>
> My thoughts for solving this it to introduce ways to short circuit. Of course,
> short circuiting could lead to subpar performance in some cases too, so it is
> tricky. I would guess that once a delta is found, it would usually make sense
> to just send it. If however a non deltafied copy is found, it might still be
> worth looking a bit further for a deltafied one, maybe a configurable amount,
> one or two?
>
> To make short circuiting work well, I believe it would make sense to order the
> packfiles in a way that fewer searches are likely to be needed before finding
> an object. I have thought about ordering based on dates, older packfiles are
> more likely to have deltas of more(most) objects. Similarly, the largest
> packfiles are also more likely to have deltafied objects in them since the
> largest packfiles are likely the ones that have been repacked, whereas small
> ones are likely to be new pushes which are more likely to have been thickened
> (which removes some deltas) on receipt of the pack. Additionally it might make
> sense to dynamically sort the packs based on the results of the searches
> themselves. As certain packfiles start to shine as the best candidates, they
> would get searched first. This might help transition well from packfile to
> packfile dynamically, especially with large object counts (clones), or if they
> have "islands" in them.
>
> Lastly, what does git do in this situation? What tradeoffs, if any, does it
> make when deciding which copy of an object to send?
Good point, let me repeat the tests with the latest and greatest c-based Git implementation.
>
>> P.S. I am planning to prepare a patch for implementing 3. If we believe it’s
>> a good idea to auto-disable the phase.
>
> I look forward to testing out what you come up with,
+1
Luca.
>
> -Martin
>
> --
> The Qualcomm Innovation Center, Inc. is a member of Code
> Aurora Forum, hosted by The Linux Foundation
>