Re: [jgit-dev] JGit DFS backend - has anyone tried to implement Cassandr

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [jgit-dev] JGit DFS backend - has anyone tried to implement Cassandra?

From: Shawn Pearce <spearce@xxxxxxxxxxx>
Date: Fri, 8 Jan 2016 08:28:45 -0800
Delivered-to: jgit-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/jgit-dev>
List-help: <mailto:jgit-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/jgit-dev>, <mailto:jgit-dev-request@eclipse.org?subject=unsubscribe>

On Fri, Jan 8, 2016 at 3:26 AM, Luca Milanesio <luca.milanesio@xxxxxxxxx> wrote:
>
>> On 7 Jan 2016, at 15:37, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
>>
>> On Thu, Jan 7, 2016 at 5:56 AM, Luca Milanesio <luca.milanesio@xxxxxxxxx> wrote:
>>
>>> Cassandra is getting momentum for his ability of being scalable, very fast
>>> in read, distributed on single or multiple geographical zones,
>>
>> How does current Cassandra do on the Jepson tests?
>> https://aphyr.com/posts/294-jepsen-cassandra

Did you read this ^^ ?

> Actually Cassandra chooses to let the last write win: if you have client A and client B in two different zones writing exactly at the same time the same key, however comes last (in Cassandra's timing) will be written to disk.

Last time I looked, Cassandra's timing is just the wall/system clock
of the computer its running on. If you have 5 machines, even in the
same rack of the same data center with NTP, their times will be off by
at least a few milliseconds. Its a well known concept in distributed
computing that you cannot rely on the clock. This is why vector clocks
came about. And then of course Google went and proved otherwise with
Spanner[2]. But Cassandra does not incorporate what Spanner does to
make system clocks work.

[2] http://research.google.com/archive/spanner.html

> I was thinking about Cassandra for extensibility and scalability of storage in the *same* geographical zone (local consistency) and Cassandra serves well and quickly that scenario.
>
> If we want to push Cassandra to the global consistency level across remote geo-locations without a powerful dedicated CDN, it wouldn't work in practice as the introduced latency for the consistency checks would be too high.
>
> Another approach is to "endorse inconsistency" and managing it.
>
> Let's say that client A wants to push commit CA to master branch and at the same time client B wants to push commit CB to the same master branch of the same project at the same millisecond across two different zones.
> CA and CB, assuming that they are different commits with different content, will have different SHA1 for their BLOBs, Trees and commit objects: we don't have then any conflict from a Cassandra perspective.
> The trouble is when we want to update the ref, will the new master point to CA or CB?
>
> A solution could be: append both and let the read operation resolve the conflict.
> If you add both CA and CB, the client A and client B will find that after the push there are two values appended instead of one => they can then treat this case as a conflict and ask to repeat the operation.

You may not be able to see CA yet at the end of the CB push. How long
does client B wait while polling to see if a CA is created?

What if the timestamps of CA and CB differ by only a couple of
milliseconds? From the point of view of Cassandra CB wins because its
timestamp is 2 milliseconds later than CA. But from the point of view
of the users, CA and CB happened at the same time. Human's can't
really perceive a 2 millisecond time. Heck that 2 millisecond
difference could have just been introduced by a kernel scheduling
hiccup alternating one thread off a core to run some other thread.

> As both CA and CB have all their objects already pushed (but only the refs wasn't updated), their second push attempt will be very quick. The retry wouldn't then create too much trouble to both.

For that 2nd push to be really quick you need the ls-remotes during
push to advertise both CA and CB. But only one value can be shown on
the reference. So the other will have to be hidden in a ".have" line.
Entirely doable, but requires a bit more coding. You have to override
RefDatabase.getAdditionalRefs() and return values there so they can be
folded into the ".have" advertisement. IIRC Gerrit ignores this list
by default, so you may also have to tweak Gerrit.

> Do you (or Alex) foresee problems with this approach?

Yes. :)

>> _If_ you wanted to put everything into Cassandra, I would chunk pack
>> files into say 1 MiB chunks and store the chunks in individual rows.
>> This means configuring the DfsBlockCache using
>> DfsBlockCacheConfig.setBlockSize(1 * MB). When creating a new pack
>> generate a random unique name for the DfsPackDescription and use that
>> name and the block offset as the row key.
>>
>> DfsOutputStream buffers 1 MiB of data in RAM and then passes that
>> buffer off as a row insert into Cassandra.
>>
>> The DfsObjDatabase.openFile() method supplies a ReadableChannel that
>> is accessed in aligned blockSize units, so 1 MB alignments. If your
>> row keys are the pack name and the offset of the first byte of the
>> block (so 0, 1048576, 2097152, ...) read method calls nicely line up
>> to row reads from Cassandra. The DfsBlockCache will smooth out
>> frequent calls for rows.
>>
>> Use another row in Cassandra to store the list of packs. The
>> listPacks() method then just loads that row. commitPacks() updates
>> that row by inserting some values and removing other values. What you
>> really want to store here is the pack name and the length so that you
>> can generate the row keys.
>>
>> Reference API in DfsRefDatabase is simple. But I just committed a
>> change to JGit to allow other uses of RefDatabases. Because...
>>
>>
>> The new RefTree type[1] is part of a larger change set to allow
>> storing references inside of Git tree objects. (Git, in Git! Ahh the
>> recursion!) This may simplify things a little bit as we only really
>> need to store the pack and object data. Reference data is derived from
>> pack data.
>>
>> [1] https://git.eclipse.org/r/62967
>>
>> RefTree on its own is incomplete. I should get another few commits
>> uploaded today that provide a full RefDatabase around the RefTree
>> type. I have it coded and working, just working on the unit tests to
>> verify its working.
>>
>>
>> The longer term trend here is I'm doing some Git multi-master work
>> inside JGit now. RefTree is an important building block, but is far
>> from complete. $DAY_JOB is evolving our Git multi-master system for
>> $REASONS, and in the process trying to put support into JGit.
>
> Thanks for the suggestion, that would definitely work.
>
> If we choose 50MiB for a Cassandra row (which is still very acceptable),

Be careful about that. If the Cassandra server and client treat that
as a single byte array 50 MiB in size, moving that around is going to
be very very slow. Latency to first byte for example is about the same
as latency for the last byte, as the entire array has to be loaded in
from Cassandra's storage files, then put into the wire protocol
buffer, then read in by the client library in the JGit process before
JGit can see any of the data.

If you use smaller chunks (e.g. 1 MiB) then latency for the first MiB
is far lower. You spread the latency out over the chunks of the file.
Which may help if you don't need very much of the file. E.g to compute
a diff between two commits for Gerrit you may only need 512 KiB worth
of data, but its spread around 3 parts of the pack file. At 50 MiB per
chunk you may have to load 50 MiB to use 512 KiB. At 1 MiB per chunk
you may load 3 MiB to use 512 KiB. Latency here to move 3 MiB is
probably going to be lower than to move 50 MiB.

This is one reason our DFS system at $DAY_JOB actually uses a block
size of 64 KiB. Small reads are common in things like Gerrit and
Gitiles because the entire repository is not being accessed.

During clone we catch this and increase our read size considerably
using the setReadAheadBytes() method of JGit's dfs.ReadableChannel
interface. We use the advice to just start async loading larger
amounts of data, but the small chunk sizes means we can start making
data available to the Git client almost immediately rather than
waiting for 50 MiB to come into our server process.

> the number of packs across multiple rows will be quite limited anyway.
> It shouldn't then create significant performance penalty.
>
> One of the reasons of choosing Cassandra is for making the Git storage virtually "unlimited" with zero downtime.
> Imagine that I have now 10 Cassandra nodes with 10 TB capacity ... if the volume increases a lot, it would be very easy to add more nodes and get extra storage without significant performance penalty in read or write.

True, IIRC you can grow a Cassandra cluster while it is online, so you
can add capacity with no impact.

Follow-Ups:
- Re: [jgit-dev] JGit DFS backend - has anyone tried to implement Cassandra?
  - From: Luca Milanesio

References:
- [jgit-dev] JGit DFS backend - has anyone tried to implement Cassandra?
  - From: Luca Milanesio
- Re: [jgit-dev] JGit DFS backend - has anyone tried to implement Cassandra?
  - From: Alex Blewitt
- Re: [jgit-dev] JGit DFS backend - has anyone tried to implement Cassandra?
  - From: Luca Milanesio
- Re: [jgit-dev] JGit DFS backend - has anyone tried to implement Cassandra?
  - From: Shawn Pearce
- Re: [jgit-dev] JGit DFS backend - has anyone tried to implement Cassandra?
  - From: Luca Milanesio

Prev by Date: Re: [jgit-dev] JGit DFS backend - has anyone tried to implement Cassandra?
Next by Date: Re: [jgit-dev] JGit DFS backend - has anyone tried to implement Cassandra?
Previous by thread: Re: [jgit-dev] JGit DFS backend - has anyone tried to implement Cassandra?
Next by thread: Re: [jgit-dev] JGit DFS backend - has anyone tried to implement Cassandra?
Index(es):
- Date
- Thread

Breadcrumbs