Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[egit-dev] Re: [JGit-io-RFC-PATCH v2 2/4] Add JGit IO SPI and default implementation

Hi Shawn,

Firstly thanks for the reply, my comments are inline.

On Tue, Oct 13, 2009 at 10:15 PM, Shawn O. Pearce <spearce@xxxxxxxxxxx> wrote:
> Imran M Yousuf <imyousuf@xxxxxxxxx> wrote:
>> Firstly, I am sorry but I am not intelligent enough to perceive, how
>> do the user decide which instance of Config to use? I personally think
>> that there is no API to achieve what you just mentioned :(; i.e. the
>> user will have know CassandraConfig directly.
>
> Yes.  Well, almost.
>
> The user will have to know that s/he wants a CassandraRepository or
> a JdbcRepository in order to obtain the abstract Repository handle.
> Each of these will need different configuration, possibly data which
> is too complex to simply cram into a URL string, so I was expecting
> the application would construct the concrete Repository class and
> configure it with the proper arguments required for contact with
> the underlying storage.
>
> Since the Repository wants several things associated with it, each
> concrete Repository class knows what concrete Config, ObjectDatabase
> and RefDatabase it should create.  Those concrete classes know how
> to read a repository stored on that medium.
>

Hmm, when trying to come up with an API, where I essentially wanted a
to abstract all records, I noticed that everything uses java.io.File
and I never actually thought in this line.

Well, I was thinking of in terms of URI as I think Git and in turn
JGit (some how feels so) follows REST and from my understanding of git
storage (which very well could be incorrect) URI could be a perfect
match. Question of how a implementation requiring configuration be fed
to the SPI manager is simple before JGit is used the instance it self
is registered to the manager. So e.g. for JDBC the URI could very well
look like - jdbc://etc/X11/ for a repo path and the JDBC implementor
will already know the connection config specs etc. so there is no need
to cramp in all info in the URL.

>> Secondly, I instead was
>> thinking of porting JGit for that matter to any system supporting
>> streams (not any specific sub-class of them), such HBase/BigTable or
>> HDFS anything.... Thirdly, I think we actually have several task in
>> hand and I would state them as -
>>
>> 1. First introduce the I/O API such that it completely replaces java.io.File
>> 2. Secondly segregate persistence of for config (or config like
>> objects) and introduce a SPI for them for smarter storage.
>
> Supporting streams on an arbitrary backend is difficult.  DHTs like
> BigTable/Cassandra aren't very good at providing streams, they tend
> to have a limit on how big a row can be.  They tend to have very
> slow read latencies, but can return a small block of consecutive
> rows in one reply.
>
> I want to talk about the DHT backend more with Scott Chacon at the
> GitTogether, but I have this feeling that just laying a pack file
> into a stream in a DHT is going to perform very poorly.
>
> Likewise JDBC has similar performance problems, you can only store
> so much in a row before performance of the RDBMS drops off sharply.
> You can get a handful of rows in a single reply pretty efficiently,
> but each query takes longer than you'd like.  Yes, there is often
> a BLOB type that allows large file storage, but different RDBMS
> support these differently and have different performance when it
> comes to accessing the BLOB types.  Some don't support random access,
> some do.  Even if they do support random access read, writing a large
> 2 GiB repository's pack file after repacking it would take ages.
>
> Once you get outside of the pack file, *everything* else git stores
> is either a loose object, or very tiny text files (aka refs, their
> logs, config).  The loose object case should be handled by the same
> thing that handles the bulk of the object store, loose objects are
> a trivial thing compared to packed objects.
>
> The refs, the ref logs, and the config are all structured text.
> If you lay a Git repository down into a database of some sort,
> I think its reasonable to expect that the schema for these items
> in that database permits query and update using relatively native
> primitives in that database.  E.g. if you put this in SQL I would
> expect a schema like:
>
>  CREATE TABLE refs (
>   repository_id INT NOT NULL
>  ,name VARCHAR(255) NOT NULL
>  ,id CHAR(40)
>  ,target VARCHAR(255)
>  ,PRIMARY KEY (repository_id, name)
>  ,CHECK (id IS NOT NULL OR target IS NOT NULL));
>
>  CREATE TABLE reflogs (
>   repository_id INT NOT NULL
>  ,name VARCHAR(255) NOT NULL
>  ,old_id CHAR(40) NOT NULL
>  ,new_id CHAR(40) NOT NULL
>  ,committer_name VARCHAR(255)
>  ,committer_email VARCHAR(255)
>  ,committer_date TIMESTAMP NOT NULL
>  ,message VARCHAR(255)
>  ,PRIMARY KEY (repository_id, name, committer_date));
>
>  CREATE TABLE config (
>   repository_id INT NOT NULL
>  ,section VARCHAR(255) NOT NULL
>  ,group VARCHAR(255)
>  ,name VARCHAR(255) NOT NULL
>  ,value VARCHAR(255)
>  ,PRIMARY KEY(repository_id, section, group, name, value))
>
> This makes it easier to manipulate settings, you can use direct
> SQL UPDATE to modify the configuration, or SELECT to scan through
> reflogs.  Etc.
>
> If we just threw everything as streams into the database this
> would be a lot more difficult to work with through the database's
> own native query and update interface.  You'd lose alot of the
> benefits of using a database, but still be paying the massive price
> in performance.
>

To be honest I understood your point the last time you mentioned it
:). I agree with performance part fully and I too have doubts, that is
why while you were mentioning HBase, I was HDFS :) and I was actually
thinking of in terms of FS. But after your elaboration, it just makes
more sense for the changes -
* Firstly we decouple from a particular FS and have our own I/O, for
packed and loose objects, so that we can easily retain the current
behavior.
* Then we first rework the key objects, e.g. Refs, Configs etc. to
segregate their persistence, that is introduce their persistence layer
which should friendly enough for native operations for platforms such
as RDBMS or HBase or BigTable. Using packs will depend of setup and
repositories, but certain implementations
* We implement this SPI to support different persistence platforms then :).

The thing is I first want to make JGit independent of java.io.File :),
thats was my motto to start with, but you showed me a path beyond that
:) and that is free from java.io.File and optimized persistence API
:). I want to have them both and let the implementor choose how to
implement it :).

>> I am not thinking of storing "only" the bare content of a git
>> repository, but I intent to be able to also store the versioned
>> contents it self as well.
>
> When I say "bare repository" I only mean a repository without a
> working directory.  It still holds the complete revision history.
> If you wanted a git repository on Cassandra but wanted to actually
> have a working directory checkout, you'd need the local filesystem
> for the checkout and .git/index, but could otherwise keep the objects
> and refs in Cassandra.  Its nuts... but in theory one could do it.
>

My requirement also involves needing to check it out on HDFS :), that
is why I was mentioning it, but it could be a different topic other
than that of JGit.

Eagerly waiting for a reply.

Thank you,

-- 
Imran M Yousuf
Entrepreneur & Software Engineer
Smart IT Engineering
Dhaka, Bangladesh
Email: imran@xxxxxxxxxxxxxxxxxxxxxx
Blog: http://imyousuf-tech.blogs.smartitengineering.com/
Mobile: +880-1711402557


Back to the top