Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] JGit DFS backend - has anyone tried to implement Cassandra?

Here is my 2c for getting Cassandra to work when updating a ref, see below.

On Friday, January 8, 2016 at 6:19:50 PM UTC, lucamilanesio wrote:

> On 8 Jan 2016, at 16:28, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
>
> On Fri, Jan 8, 2016 at 3:26 AM, Luca Milanesio <luca.milanesio@xxxxxxxxx> wrote:
>>
>>> On 7 Jan 2016, at 15:37, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
>>>
>>> On Thu, Jan 7, 2016 at 5:56 AM, Luca Milanesio <luca.milanesio@xxxxxxxxx> wrote:
>>>
>>>> Cassandra is getting momentum for his ability of being scalable, very fast
>>>> in read, distributed on single or multiple geographical zones,
>>>
>>> How does current Cassandra do on the Jepson tests?
>>> https://aphyr.com/posts/294-jepsen-cassandra
>
> Did you read this ^^ ?

Yes I did and I discarded the idea of updating cells :-) You cannot reliably know who was the "last who updated the cell" as you may think is you (on your replica) but as a matter of fact someone else in another replica on the other side of the planet can think the same thing ... and at the end of the day just Cassandra knows who won, not you, which isn't nice as one of the two has lost his write.

I am not concerned about Cassandra loosing data on updates, as soon as the client of that write has a way to detect that the situation and try again.
The problem with cell updates is that you can never know if you lost your data or not :-(

>
>> Actually Cassandra chooses to let the last write win: if you have client A and client B in two different zones writing exactly at the same time the same key, however comes last (in Cassandra's timing) will be written to disk.
>
> Last time I looked, Cassandra's timing is just the wall/system clock
> of the computer its running on. If you have 5 machines, even in the
> same rack of the same data center with NTP, their times will be off by
> at least a few milliseconds. Its a well known concept in distributed
> computing that you cannot rely on the clock. This is why vector clocks
> came about. And then of course Google went and proved otherwise with
> Spanner[2]. But Cassandra does not incorporate what Spanner does to
> make system clocks work.
>
> [2] http://research.google.com/archive/spanner.html

Agreed, that bad and makes the updates not reliable.

>
>> I was thinking about Cassandra for extensibility and scalability of storage in the *same* geographical zone (local consistency) and Cassandra serves well and quickly that scenario.
>>
>> If we want to push Cassandra to the global consistency level across remote geo-locations without a powerful dedicated CDN, it wouldn't work in practice as the introduced latency for the consistency checks would be too high.
>>
>> Another approach is to "endorse inconsistency" and managing it.
>>
>> Let's say that client A wants to push commit CA to master branch and at the same time client B wants to push commit CB to the same master branch of the same project at the same millisecond across two different zones.
>> CA and CB, assuming that they are different commits with different content, will have different SHA1 for their BLOBs, Trees and commit objects: we don't have then any conflict from a Cassandra perspective.
>> The trouble is when we want to update the ref, will the new master point to CA or CB?
>>
>> A solution could be: append both and let the read operation resolve the conflict.
>> If you add both CA and CB, the client A and client B will find that after the push there are two values appended instead of _one_ => they can then treat this case as a conflict and ask to repeat the operation.
>
> You may not be able to see CA yet at the end of the CB push. How long
> does client B wait while polling to see if a CA is created?

The approach I was proposing was for geographically co-located Cassandra instances for disk-space scalability: it wouldn't work if the nodes are in completely different locations with very high latency or even moment of "temporary disconnections".

I need to investigate further into Cassandra to understand what it does in these conditions: will the nodes just put off the cluster? How quickly that happens? Needs a bit of investigation on that ...


My proposal would be the following: 
- keeping for each ref to keys on Cassandra. <ref> and <ref>.updates
- Always append data to the <branch>.updates as a collection of values: we do not have the update loss problem
- Each element of the <ref>.updates collection contains a pair (old-sha1,new-sha1)
- During read, we get all the values in <ref>.updates collection and we "navigate them" from the initial SHA1 of <ref>
- When we reach the end of the collection read => that's the final tip of the ref

A collision is more than one pair in the <ref>.updates collection with the same old-sha1.
I need to make now some intensive Cassandra writes test to see if the above algorithm works ... but it should in theory :-)

The <ref>.updates collection resolution can be compressed and reduced to <ref> at GC time.


>
> What if the timestamps of CA and CB differ by only a couple of
> milliseconds? From the point of view of Cassandra CB wins because its
> timestamp is 2 milliseconds later than CA. But from the point of view
> of the users, CA and CB happened at the same time. Human's can't
> really perceive a 2 millisecond time. Heck that 2 millisecond
> difference could have just been introduced by a kernel scheduling
> hiccup alternating one thread off a core to run some other thread.

If both add to the same collection, Cassandra allows the two objects to be added but cannot guarantee the order as it is not reliable.
The decision of "who won" cannot be done on the "last one" as both will be there in the collection. Will need to be resolved at read time and make a decision on who has to be considered as winner.

Let's put in this way: the "winner" will be eventually be elected somehow amongst the commits that ended up in the collection. We may even decide that there is "no winner" if the collection contains more than one element.
There must be a moment in time where the collection is going to be compacted to a single element: that is the moment when you read it and define the "winner" or "no winners".

In the above example:
For the cell project.master that contains the SHA1 of the master branch, there will be a collection associated to project.master.pushed.
In project.master.pushed there will be both CA and CB.

If we use the "no winners" logic, this means that both CA and CB will be considered as failed.
The problem is the cleanup or compaction logic and when to do it ... still have no answer or thoughts on that :-(

It could even be that Cassandra could be the right tool for storing Git objects and packs but NOT AT ALL for refs and I may need to use something else.
To be honest with you, my current scalability problems are on the Git objects (BLOBs, Trees, Commits) and not on the refs. It would still solve my problem if I choose the use Cassandra for objects and something else for refs.

>
>> As both CA and CB have all their objects already pushed (but only the refs wasn't updated), their second push attempt will be very quick. The retry wouldn't then create too much trouble to both.
>
> For that 2nd push to be really quick you need the ls-remotes during
> push to advertise both CA and CB. But only one value can be shown on
> the reference. So the other will have to be hidden in a ".have" line.
> Entirely doable, but requires a bit more coding. You have to override
> RefDatabase.getAdditionalRefs() and return values there so they can be
> folded into the ".have" advertisement. IIRC Gerrit ignores this list
> by default, so you may also have to tweak Gerrit.

Good to know, more work then on the horizon :-(

>
>> Do you (or Alex) foresee problems with this approach?
>
> Yes. :)

That's the type of feedback I needed :) You guys have done this before me and "knowledge review" is always as good as "code review" !

>
>>> _If_ you wanted to put everything into Cassandra, I would chunk pack
>>> files into say 1 MiB chunks and store the chunks in individual rows.
>>> This means configuring the DfsBlockCache using
>>> DfsBlockCacheConfig.setBlockSize(1 * MB). When creating a new pack
>>> generate a random unique name for the DfsPackDescription and use that
>>> name and the block offset as the row key.
>>>
>>> DfsOutputStream buffers 1 MiB of data in RAM and then passes that
>>> buffer off as a row insert into Cassandra.
>>>
>>> The DfsObjDatabase.openFile() method supplies a ReadableChannel that
>>> is accessed in aligned blockSize units, so 1 MB alignments. If your
>>> row keys are the pack name and the offset of the first byte of the
>>> block (so 0, 1048576, 2097152, ...) read method calls nicely line up
>>> to row reads from Cassandra. The DfsBlockCache will smooth out
>>> frequent calls for rows.
>>>
>>> Use another row in Cassandra to store the list of packs. The
>>> listPacks() method then just loads that row. commitPacks() updates
>>> that row by inserting some values and removing other values. What you
>>> really want to store here is the pack name and the length so that you
>>> can generate the row keys.
>>>
>>> Reference API in DfsRefDatabase is simple. But I just committed a
>>> change to JGit to allow other uses of RefDatabases. Because...
>>>
>>>
>>> The new RefTree type[1] is part of a larger change set to allow
>>> storing references inside of Git tree objects. (Git, in Git! Ahh the
>>> recursion!) This may simplify things a little bit as we only really
>>> need to store the pack and object data. Reference data is derived from
>>> pack data.
>>>
>>> [1] https://git.eclipse.org/r/62967
>>>
>>> RefTree on its own is incomplete. I should get another few commits
>>> uploaded today that provide a full RefDatabase around the RefTree
>>> type. I have it coded and working, just working on the unit tests to
>>> verify its working.
>>>
>>>
>>> The longer term trend here is I'm doing some Git multi-master work
>>> inside JGit now. RefTree is an important building block, but is far
>>> from complete. $DAY_JOB is evolving our Git multi-master system for
>>> $REASONS, and in the process trying to put support into JGit.
>>
>> Thanks for the suggestion, that would definitely work.
>>
>> If we choose 50MiB for a Cassandra row (which is still very acceptable),
>
> Be careful about that. If the Cassandra server and client treat that
> as a single byte array 50 MiB in size, moving that around is going to
> be very very slow. Latency to first byte for example is about the same
> as latency for the last byte, as the entire array has to be loaded in
> from Cassandra's storage files, then put into the wire protocol
> buffer, then read in by the client library in the JGit process before
> JGit can see any of the data.
>
> If you use smaller chunks (e.g. 1 MiB) then latency for the first MiB
> is far lower. You spread the latency out over the chunks of the file.
> Which may help if you don't need very much of the file. E.g to compute
> a diff between two commits for Gerrit you may only need 512 KiB worth
> of data, but its spread around 3 parts of the pack file. At 50 MiB per
> chunk you may have to load 50 MiB to use 512 KiB. At 1 MiB per chunk
> you may load 3 MiB to use 512 KiB. Latency here to move 3 MiB is
> probably going to be lower than to move 50 MiB.
>
> This is one reason our DFS system at $DAY_JOB actually uses a block
> size of 64 KiB. Small reads are common in things like Gerrit and
> Gitiles because the entire repository is not being accessed.

Good to know, is true that even if I would use Cassandra as a "distributed disk for JGit" it will still need to go through the network and latency is then going to be an issue.
What is acceptable and very fast on your local disk (maybe even SSD) may take a much bigger time over a Cassandra network of nodes, maybe even distributed !

>
> During clone we catch this and increase our read size considerably
> using the setReadAheadBytes() method of JGit's dfs.ReadableChannel
> interface. We use the advice to just start async loading larger
> amounts of data, but the small chunk sizes means we can start making
> data available to the Git client almost immediately rather than
> waiting for 50 MiB to come into our server process.
>
>> the number of packs across multiple rows will be quite limited anyway.
>> It shouldn't then create significant performance penalty.
>>
>> One of the reasons of choosing Cassandra is for making the Git storage virtually "unlimited" with zero downtime.
>> Imagine that I have now 10 Cassandra nodes with 10 TB capacity ... if the volume increases a lot, it would be very easy to add more nodes and get extra storage without significant performance penalty in read or write.
>
> True, IIRC you can grow a Cassandra cluster while it is online, so you
> can add capacity with no impact.


Back to the top