Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] Comparing JGit and CGit

On Thu, Jan 27, 2011 at 13:21, Robin Rosenberg
<robin.rosenberg@xxxxxxxxxx> wrote:
> I discoverad some funny numbers.

Its not that funny.  JGit uses HistogramDiff, which my testing showed
was 8x faster than MyersDiff.  C Git is still using MyersDiff.  Junio
and I talked about porting HistogramDiff to C Git, but I haven't had
the time to try and do it.... let alone run it through the gauntlet
that is the git development list's code review process.

> Seems JGit beats the pants off C Git. A quick scans implies both version
> produce correct diffs, but don't post this to the Git ML list just yet :)

I hope we produce a correct diff.  If its incorrect, we need to fix
that, and possibly sacrifice speed.  :-)

> There are some important differences. C Git produces more readable
> diffs, i.e. something that a human would produce. It also has the extra
> information in the hunk header that explains

I had hoped that HistogramDiff would be more readable.  I suspect its
not because I'm cheating and using only one side to be unique, rather
than both like PatienceDiff calls for when splitting the files into
three regions (before, common, after) and recursing.  Therefore we may
split on a common point that is unique in A, but is not unique in B...
and get a less readable diff as a result of that.  Its a fudge I tried
in order to be fast.  It may not be working out.

> Whether 4 or 16 seconds is important or not is another issue for this
> particular case. Som other systems would probably require hours of
> processing for this.

:-)

> Another suprise (to me) is that it seems JGit uses more than one core,
> i.e. it used 7.1 seconds of CPU time in 3.8 seconds, while it seems C
> Git only used one core since real time is larger than CPU time. Maybe it
> is just GC; I haven't investigates that.

Probably.  We're single-threaded the way you ran log.  There's no way
we're using more than one core for actual data processing.  I'm
surprised that even with GC activity we still used less CPU overall,
usually you pay a penalty for that.  We'd probably see the GC penalty
show up if we ported HistogramDiff.  :-)

> ---JGit diff---
> @@ -410,6 +427,21 @@
>                conn.setReadTimeout(getTimeout() * 1000);
>                authMethod.configureRequest(conn);
>                return conn;
> +       }
...
>        }

Yea, I think this is because we aren't splitting on a unique point on
both sides, just on the A side.  Its ugly.  :-(

FWIW, C Git can sometimes produce this same looking diff, Myers O(ND)
diff algorithm also has cases where this is the result.  PatienceDiff
tries to avoid this by looking for longest common unique on both
sides, and is often times more successful than Myers or HistogramDiff.
 Its also slower.

-- 
Shawn.


Back to the top