Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [jgit-dev] Distributed Git Server using JGit



On 16 May 2018, at 22:01, Matthias Sohn <matthias.sohn@xxxxxxxxx> wrote:

On Wed, May 16, 2018 at 10:12 PM, Mincong Huang <mincong.h@xxxxxxxxx> wrote:
Hi,

I'm creating a Git server, and I'd like to use JGit as implementation. JGit
contains a module called `org.eclipse.jgit.http.server` which allows to achieve
this easily via GitServlet[1]. However, I need the Git server to be clustered,
to provide a scalable solution. I've two possible solutions, but I want
to have your opinions about them.

Solution 1: N GitServlets + 1 NFS
Use N Git servlets and share the same network filesystem. Each server
points the same file system in the network. This solution is used by GitLab,
Personally, I'm afraid of concurrent file access to Git repository, which leads
to data corruption. According to this post[2], Git has mechanism to protect
itself, e.g using index lock. But a Git bare repository does not have index,
right? I'm confused.

Solution 2: N GitServlets + N DfsRepository + KeyValue DB
JGit provides an abstract class `DfsRepository`[3] to create a DFS repository.
This solution is used by Palantir[4] and Google[5], where data is stored in a
distributed database. I think this solution is for big company, and requires complex
setup. I don't have confidence to be able to implement DfsRepository correctly
and maintain an extra DB.

My implementation will be used by thousands of repositories, but only a few of
them are actively used. Therefore, the concurrent access should be very limited.

I'd like to have your comment about this subject.

Thanks,
Mincong


For option 1 I'd recommend you give Gerrit with its high-availability plugin a try
and if you face issues collaborate with the Gerrit community to improve this solution
instead of starting your own implementation which is for sure more work.

For option 2 you may consider to join an initiative started by Luca Milanesio, one
of the Gerrit maintainers, to implement an open source implementation of the
JGit DFS API on Cassandra. The current PoC patch series is maintained here

I don't get why you need a scalable server if only a few of your thousands of
repositories are actively used. There are many Gerrit installations serving
thousands of repositories from a single server.

Agreed.

We have 40k repositories in a single node on GerritHub, with HA configuration and two sites (Main in Canada and DR in Germany).
We are running at 10% of our potential capacity (32 CPUs, 128GB) and we could easily host with the same configuration 10x more repos, up to almost 500k repos.

After that, we may start use sharding, which is supported from v2.15, and with that we could then easily scale up to millions of repositories.
Having said that, when you start going towards millions of repositories you may want to have active-active multi-site, to minimise push latency, or having a more flexible storage allocation using something like Cassandra, ScyllaDb or CockroachDB.

How many repos, users and locations are you going to support?


-Matthias
_______________________________________________
jgit-dev mailing list
jgit-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/jgit-dev


Back to the top