Re: [jgit-dev] Question on large object streams

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [jgit-dev] Question on large object streams

From: Dmitry Neverov <dmitry.neverov@xxxxxxxxx>
Date: Tue, 5 Oct 2010 13:36:12 +0400
Delivered-to: jgit-dev@xxxxxxxxxxx
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=rqoqQlbKZjREL0YHKxcH+owksXstVdtQxV8cwPzMXZTyw47rW+70mO++FamP6lF0MK 0LF47aP4BNs88uGsahpo+C3Ejsbboq8e/v2zeS/pMHoiyJAKN3+kBy4gZ1ikf+h6Ctk/ BDWug6iQN+9sn/mB3knLFG9CXz06CjFhjNG5U=
List-archive: <https://dev.eclipse.org/mailman/private/jgit-dev>
List-help: <mailto:jgit-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=unsubscribe>

On Mon, Oct 4, 2010 at 10:13 PM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:

On Mon, Oct 4, 2010 at 10:17 AM, Dmitry Neverov
<dmitry.neverov@xxxxxxxxx> wrote:
> Why reading from large object streams is so slow?

It depends on how the large object is stored. :-)

If its in a pack file, and is stored as a delta to another object, its
so slow that its unusable. If its stored whole in the pack (not as a
delta), or is a loose object, its performance is acceptable. Its
slower than the fast path, but its still something that a user won't
mind waiting for.

> We have a file of size ~ 10Mb and reading its content from ObjectStream
> takes forever.
> I see a file 'noz5208794269214828797.tmp' in .git dir, it's size grows
> slowly (~2Mb per hour).
> And whenever I pause execution I saw this in stack trace:
>
> main@1, prio=5, in group 'main', status: 'runnable'
> java.lang.Thread.State: RUNNABLE
>     locked <0xb05> (a java.util.zip.Inflater)
>     locked <0x94b> (a java.io.BufferedInputStream)
>     locked <0x603> (a jetbrains.buildServer.vcs.patches.PatchBuilderImpl)
>     at java.util.zip.Inflater.inflateBytes(Inflater.java:-1)
>     at java.util.zip.Inflater.inflate(Inflater.java:215)
>     at
> java.util.zip.InflaterInputStream.read(InflaterInputStream.java:128)
>     at
> org.eclipse.jgit.storage.pack.DeltaStream.fill(DeltaStream.java:263)

RIght, this object is a delta in a pack file. What's happening here
is you are deflating the base object into a temporary file, and then
doing random seeks on that temporary file in order to apply the delta.
If the delta is at the end of a delta chain that is say 15 objects
long, you need to do this 15 times before you can get to the data for
the requested object. That can be a lot of work.

Is it correct that since temp file grows slowly, it is inflating of
base object from pack file slow down everything? Can you explain please
why use of streams make inflating slower than open blob content in the memory.

What version of JGit is this? Tip of master should be inflating these
objects into loose objects in the loose objects directory, such that
subsequent access is faster because its just streaming from the loose
object rather than the packed form. But its still slow for the
initial read. :-(

We use a tip of the master with this patch on top of it
(http://egit.eclipse.org/r/#change,1681). Jgit creates loose
objects. One thought on loose objects: if we run gc on this repo, it
will pack all these loose objects back to pack file, so gc will hurt
jgit performance, won't it?

One thing we should do is teach IndexPack about this and have it cache
the large delta object as a loose object immediately during
fetch/clone so that during checkout we have fast access to that
content. But I hadn't thought about doing that until just now, so
whatever.

> How can we speed it up?

Increase the core.streamFileThreshold in your WindowCacheConfig to a
value larger than the default. Right now the default is 5 MiB. But I
thought I had patches queued on Gerrit to increase this to 50 MiB.

It turns out that I setDeltaBaseCacheLimit to 16Kb. Use of default
WindowCacheConfig settings speeds up inflating to a temporary file,
now it takes ~12 minutes per Mb.

You can also use the -delta gitattribute when you pack the repository
to try and keep these "large" files from being delta compressed. The
resulting pack will be bigger, but JGit will perform better when
accessing it because the bigger objects can be directly streamed.

I tried that but with no noticeable effect on a performace.

--
Shawn.

Dmitry

Follow-Ups:
- Re: [jgit-dev] Question on large object streams
  - From: Shawn Pearce

References:
- [jgit-dev] Question on large object streams
  - From: Dmitry Neverov
- Re: [jgit-dev] Question on large object streams
  - From: Shawn Pearce

Prev by Date: Re: [jgit-dev] Question on large object streams
Next by Date: Re: [jgit-dev] UnpackedObject and CorruptObjectException
Previous by thread: Re: [jgit-dev] Question on large object streams
Next by thread: Re: [jgit-dev] Question on large object streams
Index(es):
- Date
- Thread

Breadcrumbs