Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [rdf4j-dev] New Memory Store - Planning

Hi Håvard,

It is an interesting proposal to produce a new in memory store.
I think you can do some really interesting things, that are a bit different than most implementations if you take some observations on common dataset properties into acount.

In my (limited) experience most datasets have a limited number of predicates and a few graphs. I know cases where this is not true.

The standard approach for a Quad store is roughly

class Store {
	Resource[] subjects;
	IRI[] predicate;
	Value[] object;
	Resource[] graph;
}

As in java all of these are arrays of pointers/references so this layout will already give you the standard quad/value table split.

However, if you assume there will be only a few (<10_000) predicates in any store you could do something like this.

class Store {
	Map<IRI, SOG> predicateToSogMap = ...
}

class SOG {
	Resource[] subjects;
	Value[] object;
	Resource[]graph;
}

This optimization allows you to very quickly find all triples with a certain predicate and not avoid storing this repetitively again and again.

Now assuming you normalize your values on insert a single array layout
will give you more benefit for than using unsafe structures.

class SOG {
	Value[] subjectsObjectsGraphs;
}

basically you this big array stores all values linearly and you waste no memory in this approach. It is super word optimization friendly for the full scan cases.

Now this structure is rather basic so I would not have one or three big array's at all.

I would structure the "SOG" part as Van Emde Boas structured search tree based on the value of the Subject. Or in any case try to have a structure that allows fast searching on "S".

If you have fast search on P and on S in your primary datastore you can have much smaller indexes and need to maintain much less data.

Assuming you will have few graphs you can even do a structure like this.


class SOG {
	Resource[] subjects;
	Value[] object;
	Map<Resource, BitSet> graphBitSet = ...
}

This will allow very fast filtering of potential quads from a result set by removing all rows in SO that do not have a corresponding bit set in the bitset belonging corresponding to the graph. If this property holds you could probably survive with just an OPS index.

Now most of these thoughts where aimed at a fully static store but I am sure some of it could be applied to an in-memory dynamic one as well.

Of course the array approaches will limit you to 2 billion triples per predicate... but I assume that is fine as that is already going into the direction of multi terabyte heaps... Having said that there will be real value by splitting theses arrays into smaller blocks as that will allow more tricks and future optimization e.g. inline values as well as dealing with deleting "quad rows" in an atomic and threadsafe fashion.

Regards,
Jerven


On 07/15/2017 11:04 AM, "Håvard M. Ottestad" wrote:
Hi,

I’m look at creating a new Memory Store that is configurable so that it can be optimised for any scenario.

Here is what I’m thinking of doing:

  - B+ tree indexes (because they support range queries)
  - Thread based indexing (async)
  - Deletes in diff index (merged when index.size > some max)
  - Hashmap index for exists(triple1) queries

I don’t want to have transactional support. I use transactions with disk based databases, but the memory store I usually only use for embedded purposes that are single threaded.

Scenarios I want to support:
  - read heavy
  - write heavy
  - transforms (deserialise, query, update, serialise)

In general, are there any recommendations or requirements that others might have before I get too committed?

I also know that a lot of triple stores convert IRIs and literals to a hash/integer so that the hash/integer is stored in the indexes and the IRIs/literals are stored in a lookup table. I’m considering doing this too, but it might not be of much benefit unless I migrate to manual memory management using the unsafe library. Any thoughts on using sun.misc.Unsafe?

Håvard
_______________________________________________
rdf4j-dev mailing list
rdf4j-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/rdf4j-dev


--
-------------------------------------------------------------------
Jerven Bolleman                        Jerven.Bolleman@sib.swiss
SIB Swiss Institute of Bioinformatics  Tel: +41 (0)22 379 58 85
CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
1211 Geneve 4,
Switzerland     www.sib.swiss - www.uniprot.org
Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------


Back to the top