Re: [jgit-dev] Question on large object streams

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [jgit-dev] Question on large object streams

From: Shawn Pearce <spearce@xxxxxxxxxxx>
Date: Tue, 5 Oct 2010 12:17:32 -0700
Delivered-to: jgit-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/jgit-dev>
List-help: <mailto:jgit-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=unsubscribe>

On Tue, Oct 5, 2010 at 2:36 AM, Dmitry Neverov <dmitry.neverov@xxxxxxxxx> wrote:
>
> Is it correct that since temp file grows slowly, it is inflating of
> base object from pack file slow down everything? Can you explain please
> why use of streams make inflating slower than open blob content in the
> memory.

A delta is stored as a sequence of instructions to either copy data
from the base object, or to insert a section of data which doesn't
appear in the base (but needs to appear in the result).  A copy
instruction has two arguments, the offset within the base to copy
from, and the number of total bytes to copy.  The offset is absolute
from the start of the base.

It is not uncommon for a delta instruction stream to skip around
within the base.  That is, it might be something like:

  COPY from=45, len=12
  INSERT "foo"
  COPY from=12, len=4
  INSERT "bar"
  COPY from=63, len=8
  INSERT "q"
  COPY from=0, len=8

Stepping from a COPY from=45 to COPY from=12 requires seeking
backwards in the base to reposition from offset 57 (where the copy
ended) to offset 12 (where the next copy starts).  When the base
object is stored in as a byte[] in memory this seek backwards is free,
we set a variable to the new position (12) and use that as the array
index to copy with.  When the base object is an inflater stream we
can't just seek backwards.  We have to close the stream, open it
again, and skip forwards to the position.  Skipping forwards in an
inflate stream requires inflating to a discard buffer.

That is why this is slow.  The examples above are contrived and on a
tiny object.  Most objects have much larger distances involved,
requiring a lot more skipping within the stream.  Unfortunately the
delta instruction generators for Git (both C and Java versions) always
favor the first N locations (typically N=64) within a file where a
given content can occur.  Now imagine an XML file that needs to use a
particular element very often... we may wind up copying that element
name from the first occurrence in the base each time we need it, but
then need to seek around to inject other common parts.

> We use a tip of the master with this patch on top of it
> (http://egit.eclipse.org/r/#change,1681). Jgit creates loose
> objects. One thought on loose objects: if we run gc on this repo, it
> will pack all these loose objects back to pack file, so gc will hurt
> jgit performance, won't it?

Yes.  Junio C Hamano and I talked about this.  We might want to
propose a change to C Git's gc command that leaves loose objects like
these around for a short period of time, e.g. 2 weeks, so that the
loose object directory can also act as a cache.  But until that's
working, gc will hurt JGit performance on big things that are stored
as deltas.

>> Increase the core.streamFileThreshold in your WindowCacheConfig to a
>> value larger than the default.  Right now the default is 5 MiB.  But I
>> thought I had patches queued on Gerrit to increase this to 50 MiB.
>>
> It turns out that I setDeltaBaseCacheLimit to 16Kb. Use of default
> WindowCacheConfig settings speeds up inflating to a temporary file,
> now it takes ~12 minutes per Mb.

16 KiB is incredibly tiny for deltaBaseCacheLimit.  At 16 KiB you
cannot fit very much in there, which means your application probably
needs to unpack every delta chain every time something is referenced.
This limit only applies to the objects under the streamFileThreshold
size, but its still an important cache for performance (as you found
out).

If you have to make a tradeoff, use a higher deltaBaseCacheLimit and a
lower packedGitLimit.  IIRC this still tends to perform better because
you don't need to do 50+ delta unpacks on each object read, but you
may need to do more syscalls to copy a block of data from kernel space
to the JVM.

>> You can also use the -delta gitattribute when you pack the repository
>
> I tried that but with no noticeable effect on a performace.

Did you repack with -f flag to force recomputing deltas?  The idea of
-delta gitattribute is to force git to store the object *not* as a
delta... but instead as a whole object.  That makes it possible to
read in a single inflate scan through the pack file, at the expense of
a more higher disk usage and network bandwidth.  Classic space/time
tradeoff.  :-)

-- 
Shawn.

Follow-Ups:
- Re: [jgit-dev] Question on large object streams
  - From: Dmitry Neverov
- Re: [jgit-dev] Question on large object streams
  - From: Dmitry Neverov

References:
- [jgit-dev] Question on large object streams
  - From: Dmitry Neverov
- Re: [jgit-dev] Question on large object streams
  - From: Shawn Pearce
- Re: [jgit-dev] Question on large object streams
  - From: Dmitry Neverov

Prev by Date: Re: [jgit-dev] UnpackedObject and CorruptObjectException
Next by Date: Re: [jgit-dev] Question on large object streams
Previous by thread: Re: [jgit-dev] Question on large object streams
Next by thread: Re: [jgit-dev] Question on large object streams
Index(es):
- Date
- Thread

Breadcrumbs