Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] JGit low hanging fruit?

On Fri, Feb 11, 2011 at 21:54, Ketan Padegaonkar
<ketanpadegaonkar@xxxxxxxxx> wrote:
> I am looking to hack into some jgit and looked at bugzilla and gerrit
> for some low hanging fruit.

Did you find any low hanging fruit you want to work on?

> I flipped through the git internals ebook from peepcode
> (http://peepcode.com/products/git-internals-pdf) and looked at grit
> (https://github.com/schacon/grit/tree/master/lib/grit). Grit is a pure
> ruby git implementation in a few hundred lines of code, that maps
> nicely to git concepts. So I can say I understand a bit of some low
> level internals and how the filesystem works.
>
> I'm trying to map this understanding so far to JGit but have been
> unsuccessful so far. Are there any pointers that could help me get my
> hands dirty with JGit ?

Its complicated.  :-(

JGit used to use a simple layout like you might see in Grit. But we
couldn't get the performance I wanted. So I hacked at it until it ran
like it should, and often rivals the CGit implementation. But this
level of optimization was hard-won, I had to obfuscate some code in
the performance critical sections in order to get efficiency, and this
lost readability. FWIW, these similar areas of CGit aren't clean
either, and have even some more optimizations that weren't put into
JGit yet.

RevCommit and RevTag are the two JGit objects responsible for parsing
the commits and tags out of the repository for processing by an
application. However these cannot be parsed into a Java object in
isolation, they must be parsed as part of a RevWalk, which contains an
internal pool of objects that ensures reference equality when the same
object SHA-1 is presented to the RevWalk. These three classes are
JGit's "ORM-ish thingy" that handles most of the history traversal.
They aren't pretty, but they damn fast.

TreeWalk handles iterating over trees of any type, including the trees
that come from the repository. Its marginally fast. Its complicated
because it does a merge-join across the sorted trees its fed, but it
uses a really fast tree parser called CanonicalTreeParser that knows
how to scan through the records in a Git tree. NameConflictTreeWalk is
a subclass which has special (but expensive in some uncommon corner
cases) logic to handle merge-joins of N trees where a file or
directory might have changed type (e.g. converting a directory to be a
symlink to point to the new location). Neither TreeWalk class is
pretty because you cannot do random access of entries, you can only
iterate over them. And they aren't even standard Java Iterator style,
its a custom style more similar to java.sql.ResultSet in usage. This
is what allows it to be marginally fast; the merge performance isn't
what it should be, because often we are scanning an unsorted list from
the filesystem and need to sort it first, resulting in O(N log N) cost
rather than O(N). Its faster when the working directory isn't involved
(e.g. just two tree objects coming from the repository).


As for object storage, every Repository has an ObjectDatabase that
corresponds to the $GIT_DIR/objects directory within the Repository.
The standard implementation of ObjectDatabase is called
ObjectDirectory, and is horrifically complex. I cannot really follow
its logic most days, but its because there are rules about the order
we need to search for an object in in order to have any reasonable
performance, especially on systems with slow VFS layers in their
kernels like *cough* Win32... but even Linux benefits from these
rules, they just make the code nasty to dig through.

ObjectDirectory has a File that corresponds to the $GIT_DIR/objects
that it checks basically last, and also a list of PackFile. Each
PackFile corresponds to a pair of pack-$ID.pack and pack-$ID.idx files
in the $GIT_DIR/objects/pack directory. Each PackFile has a PackIndex,
which may occur in either the V1 or V2 formats... and there are two
different subclasses to handle the variations in the formats. Object
lookup proceeds by scanning the PackIndex of each PackFile for the
object, and then looking in the objects directory if it wasn't found.

JGit code must read objects using an ObjectReader, which can be
obtained either from the Repository, or from the ObjectDatabase. The
ObjectReader is that thread's connection to the database, and stores
the state the database uses on behalf of that thread to read objects.
Right now it has a tiny cache that tries to help the common case
RevWalk and TreeWalk see where they need to access about the same area
of the repository again on the next read request. Due to historical
stupidity, this implementation is called WindowCursor for the
storage.file implementation.

But really, most application code is working with a RevWalk or
TreeWalk, which maintain their own ObjectReader, or can borrow one
from the caller. If you need to read a blob to get file contents,
you'll need to use an ObjectReader's open() method to get the
ObjectLoader, which has the blob's contents.


Clear as mud?

-- 
Shawn.


Back to the top