Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] Implementing shallow clones

Matt Fischer <mattfischer84@xxxxxxxxx> wrote:
> >> 2)  StartGenerator makes use of a new DepthGenerator to tag the
> >> commits, but it has to be told to include this generator.  At the
> >> moment, I just did this by adding a new setter on StartGenerator,
> >> which gets the job done, but feels kind of ugly.
> >
> > Why not:
> >
> >  if (walker instanceof DepthWalk)
> >    g = new DepthGenerator(g);
> >
> > No need for a flag, its just done automatically if the walker was
> > allocated to perform depth computations.
> >
> I thought about that, but I'll need to do it for both DepthWalk and
> DepthObjectWalk.  I guess two classes isn't all that bad, but it felt
> like it might be getting a little ugly to have a big list in there.
> I'll go for that, though.

Yikes, I forgot that we need both a DepthWalk and a DepthObjectWalk.
Which is cute because the ObjectWalk has a much higher overhead
because of the need to track the trees, so you can't always use it.

Oh well, as you point out, two classes isn't that bad.
 
> > I don't think you need to enforce topo order here at all.
>
> Actually we do.  In addition to chopping off the walk at the
> appropriate times, I also need to send the list of boundary commits
> back, so that the client can mark them as shallow.  This requires me
> to know the depth on these commits exactly, or else I might mistakenly
> send back a commit that looks like it's at the boundary but really
> isn't.

OK.  That's reason enough to enforce topo order.

> > The depth filter may cause the pending generator to cut early because
> > of this late depth update.  This could cause us to exclude some
> > commits whose depth used to be below the cut-off, but are now above
> > it because we later found a shorter path to them.  But I think we
> > can fix this by just re-inserting the moved parent into the pending
> > queue, but only if it had been previously discarded by the filter.
> > To do that you probably need to allocate a RevFlag and share it
> > between this generator and the filter.  If the filter rejects a
> > commit because its depth is higher than the cutOffDepth it should
> > add the flag.  Then here if the flag is set we shove it back into
> > the queue when it moved above the cut off.  We'll pick it up again
> > during a future next() call and it will appear in the output.
>
> Interesting idea.  It seems a bit odd that the DepthGenerator knows
> about the cutoff level from the DepthFilter, though.  Would it be so
> bad to just make the DepthGenerator do all the filtering itself, and
> dispose of the DepthFilter entirely?

Just because you reach the depth on one branch doesn't mean you've
reached it on all of the other branches.  So the DepthGenerator
needs to tag a commit as being at depth, carry that onto all of
its parents, and then exclude those commits as they come out of
the PendingGenerator.

I guess that is no different than if you had a RevFilter performing
the logic, because the filter doesn't influence the UNINTERESTING
flag, which is how PendingGenerator decides what to cut.

So I think you may be right, you can just do the filtering in
DepthGenerator.

> I wasn't completely familiar with the concept of non-incremental
> generators until now, but now that I understand them, I'm wondering if
> the best way to do this would be to just make DepthGenerator descend
> from TopoSortGenerator, and tag them as they come out of it that way.

Interesting idea.  Though I really hate extending stuff in the
pipeline, I prefer to just wrap one generator with another because
it makes setting up the pipeline a bit more flexible.

> If I do the filtering as part of the generator itself, then I won't
> have to worry about the PendingGenerator applying the filter too
> early, and everything should work out.  At that point, I could even
> make the generator have a == mode and a <= mode, so that I can use one
> for the boundary generation, and the other for the pack generation.
> Does this sound like it would work?

Probably.

> This still has the disadvantage of walking the entire tree instead of
> cutting off once the depth has been reached, but that doesn't seem
> like it's the end of the world.  Do you feel like that's going to be a
> big performance drain, or would it be ok?

On small projects like JGit the difference isn't really that
noticable between spinning the entire history through the topo
generator, or spinning only a recent slice.

But on bigger projects like the Linux kernel, its very noticable.
Walking the commit chain just isn't fast enough, and never will be
in Java.  Even if pack v4 comes around and magically improves things.
Over time, its only going to get worse since its O(N), where N is
the life of the project.  :-(

I know it sounds kind of messy to be mucking with the depth in one
place, and then correcting for mistakes in another, but its sort
of par-for-the-course when doing commit walking, and is partly why
this code is so bloody complex.  Its all based around approximations
that are usually correct, but may yield incorrect results, and then
we try to patch them up.  Because for the really common cases, the
cheap-and-fast approximation is correct.  And for the less-common
cases, we can patch up the mistake with fairly little effort.
And then the really corner cases, well, we just fail and produce
incorrect results.  There's some work going on in C Git to try and
configure how much effort will be spent to patch up a corner case,
but their apparently leaning towards "not much" because a single
bad case years ago in a project might hurt performance forever.

Because topo sort requires complete knowledge of the commit graph,
we have to limit the size of the graph we feed it to be as small as
we can get away with and still have a good chance at being correct,
otherwise performance would just suck.

That's what the OVER_SCAN stuff is about in PendingGenerator.
We walk a few more commits than we think we need to, just in case
there is some clock skew going on.  C Git has recently changed that
to be configurable in the configuration file, because some retard
in the Linux kernel community managed to have their computer clock
be months off when they made a single commit.  But always scanning a
few extra months of commits "just in case" really hurts performance,
so they aren't sure they want to be doing that forever.

Since shallow support is all about grabbing only a recent slice of
the project, ideally we would be able to do it without needing to
enumerate the complete history of the project.  :-\

-- 
Shawn.


Back to the top