Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] JGit DFS backend - has anyone tried to implement Cassandra?

On Thu, Jan 7, 2016 at 5:56 AM, Luca Milanesio <luca.milanesio@xxxxxxxxx> wrote:
> Hi Alex,
> thank you for your quick reply: as "you know someone" who did a real JGit
> DFS implementation ... and you believe is possible ... we get more
> confidence in starting this work.
> Should you have spare time to support us with answers or code, it will be
> really appreciated :-)

FWIW, Alex wrote his DFS implementation with almost no help from me. :)

> Cassandra is getting momentum for his ability of being scalable, very fast
> in read, distributed on single or multiple geographical zones,

How does current Cassandra do on the Jepson tests?
https://aphyr.com/posts/294-jepsen-cassandra

> which would
> make it a perfect candidate for Gerrit.
> We may have a Cassandra expert helping us with this work ... and maybe
> someone from DataStax could help as well.
>
> Waiting for Shawn to wake up if he has some updates on his 5 years old post
> on this topic.

The 5 year ago Cassandra work I did was based on JGit DHT, which is a
different design. No database could keep up with JGit DHT, so I
abandoned that approach and deleted the code from JGit. JGit DFS was
the outcome of all of that.

_If_ you wanted to put everything into Cassandra, I would chunk pack
files into say 1 MiB chunks and store the chunks in individual rows.
This means configuring the DfsBlockCache using
DfsBlockCacheConfig.setBlockSize(1 * MB). When creating a new pack
generate a random unique name for the DfsPackDescription and use that
name and the block offset as the row key.

DfsOutputStream buffers 1 MiB of data in RAM and then passes that
buffer off as a row insert into Cassandra.

The DfsObjDatabase.openFile() method supplies a ReadableChannel that
is accessed in aligned blockSize units, so 1 MB alignments. If your
row keys are the pack name and the offset of the first byte of the
block (so 0, 1048576, 2097152, ...) read method calls nicely line up
to row reads from Cassandra. The DfsBlockCache will smooth out
frequent calls for rows.

Use another row in Cassandra to store the list of packs. The
listPacks() method then just loads that row. commitPacks() updates
that row by inserting some values and removing other values. What you
really want to store here is the pack name and the length so that you
can generate the row keys.

Reference API in DfsRefDatabase is simple. But I just committed a
change to JGit to allow other uses of RefDatabases. Because...


The new RefTree type[1] is part of a larger change set to allow
storing references inside of Git tree objects. (Git, in Git! Ahh the
recursion!) This may simplify things a little bit as we only really
need to store the pack and object data. Reference data is derived from
pack data.

[1] https://git.eclipse.org/r/62967

RefTree on its own is incomplete. I should get another few commits
uploaded today that provide a full RefDatabase around the RefTree
type. I have it coded and working, just working on the unit tests to
verify its working.


The longer term trend here is I'm doing some Git multi-master work
inside JGit now. RefTree is an important building block, but is far
from complete. $DAY_JOB is evolving our Git multi-master system for
$REASONS, and in the process trying to put support into JGit.


Back to the top