Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] Gerrit clone time takes ages, because of PackWriter.searchForReuse()

On Friday, October 2, 2020 2:24:17 PM MDT Terry Parker wrote:
> On Thu, Oct 1, 2020 at 4:05 PM Martin Fick <mfick@xxxxxxxxxxxxxx> wrote:
> > On Thursday, October 1, 2020 4:51:46 PM MDT Martin Fick wrote:
> > > On Thursday, October 1, 2020 4:44:22 PM MDT Martin Fick wrote:
> One thing that may help is ignoring duplicated content that a client sends
> to the server (as a follow on to the "goodput" measurement
> <https://git.eclipse.org/r/c/jgit/jgit/+/160004>). The idea is that
> when a server receives a pack from a client and realizes it contains objects
> the server already has, we can drop those objects from the received pack's
> index (making them invisible) and the next GC or compaction will drop them
> and reclaim the storage.
> 
> I'm not sure if that solves all of the searchForReuse inefficiencies, but it
> should help.
> 
> The Gerrit workflow of creating new refs/changes branches from a client
> that may not have been synced for a few days, plus the lack of negotiation
> in push, makes for this conversation:
> client: hello server!
> server: here are sha-1s for my current refs (including active branches
> updated in the last N minutes)
> client: I synced two days ago. I don't recognize most of those sha-1s. You
> didn't mention the parent commit for my new change, which was the tip of
> the 'main' branch when I started working on it. I see a sha-1 for the build
> we
> tagged last week. Let me send you all of history from that sha-1 to my
> shiny new change.
> 
> The situation is bad. Those goodput metrics show that for active Gerrit
> repos, over 90% of the data sent in pushes is data the server already has.
> Negotiation in push is the best way to solve it, but until that is widely
> available in clients (and even beyond to deal with cases where
> negotiation breaks down), I think ignoring duplicate objects at the
> point where the server receives them will make life a lot better.

I haven't looked at those measurements, do they take into account the actual 
transferred bytes? I wonder if the amount of data sent by the client (Mbs 
transferred) is nearly as big as how much data hits the disk in those 
packfiles? If the transfer is much smaller than the pack file, then it would 
mean that the duplicate data on disk is mostly the result of packfile 
thickening, not duplicate data being sent.

Packfile thickening is very wasteful, potentially much more then sending 
duplicate deltas since deltas tend to be space efficient, and thickening 
undoes all that efficiency by copying a full non deltafied object from an 
existing pack into the new pack making it balloon. So even though an out of 
date client may need to send a bunch of extra diffs the server doesn't need 
because it thinks the server is out of date, it does that usually in a pretty 
space efficient manner, but that space efficiency is lost upon receipt! It 
save transfer data, but not storage data. I would not be surprised if this is 
where the extra data comes from mostly.

It would be nice if new packfiles could be appended to existing packfiles so 
that they would not need to be thickened,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation



Back to the top