Re: [rdf4j-dev] New Memory Store

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [rdf4j-dev] New Memory Store - Planning

From: Jerven Tjalling Bolleman <jerven.bolleman@sib.swiss>
Date: Mon, 17 Jul 2017 11:14:34 +0200
Delivered-to: rdf4j-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/rdf4j-dev>
List-help: <mailto:rdf4j-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/rdf4j-dev>, <mailto:rdf4j-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/rdf4j-dev>, <mailto:rdf4j-dev-request@eclipse.org?subject=unsubscribe>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0

Hi Håvard,

It is an interesting proposal to produce a new in memory store.

I think you can do some really interesting things, that are a bitdifferent than most implementations if you take some observations oncommon dataset properties into acount.

In my (limited) experience most datasets have a limited number ofpredicates and a few graphs. I know cases where this is not true.


The standard approach for a Quad store is roughly

class Store {
	Resource[] subjects;
	IRI[] predicate;
	Value[] object;
	Resource[] graph;
}

As in java all of these are arrays of pointers/references so this layoutwill already give you the standard quad/value table split.

However, if you assume there will be only a few (<10_000) predicates inany store you could do something like this.


class Store {
	Map<IRI, SOG> predicateToSogMap = ...
}

class SOG {
	Resource[] subjects;
	Value[] object;
	Resource[]graph;
}

This optimization allows you to very quickly find all triples with acertain predicate and not avoid storing this repetitively again and again.


Now assuming you normalize your values on insert a single array layout
will give you more benefit for than using unsafe structures.

class SOG {
	Value[] subjectsObjectsGraphs;
}

basically you this big array stores all values linearly and you waste nomemory in this approach. It is super word optimization friendly for thefull scan cases.

Now this structure is rather basic so I would not have one or three bigarray's at all.

I would structure the "SOG" part as Van Emde Boas structured search treebased on the value of the Subject. Or in any case try to have astructure that allows fast searching on "S".

If you have fast search on P and on S in your primary datastore you canhave much smaller indexes and need to maintain much less data.


Assuming you will have few graphs you can even do a structure like this.


class SOG {
	Resource[] subjects;
	Value[] object;
	Map<Resource, BitSet> graphBitSet = ...
}

This will allow very fast filtering of potential quads from a result setby removing all rows in SO that do not have a corresponding bit set inthe bitset belonging corresponding to the graph. If this property holdsyou could probably survive with just an OPS index.

Now most of these thoughts where aimed at a fully static store but I amsure some of it could be applied to an in-memory dynamic one as well.

Of course the array approaches will limit you to 2 billion triples perpredicate... but I assume that is fine as that is already going into thedirection of multi terabyte heaps... Having said that there will be realvalue by splitting theses arrays into smaller blocks as that will allowmore tricks and future optimization e.g. inline values as well asdealing with deleting "quad rows" in an atomic and threadsafe fashion.


Regards,
Jerven


On 07/15/2017 11:04 AM, "Håvard M. Ottestad" wrote:

Hi,

I’m look at creating a new Memory Store that is configurable so that it can be optimised for any scenario.

Here is what I’m thinking of doing:

  - B+ tree indexes (because they support range queries)
  - Thread based indexing (async)
  - Deletes in diff index (merged when index.size > some max)
  - Hashmap index for exists(triple1) queries

I don’t want to have transactional support. I use transactions with disk based databases, but the memory store I usually only use for embedded purposes that are single threaded.

Scenarios I want to support:
  - read heavy
  - write heavy
  - transforms (deserialise, query, update, serialise)

In general, are there any recommendations or requirements that others might have before I get too committed?

I also know that a lot of triple stores convert IRIs and literals to a hash/integer so that the hash/integer is stored in the indexes and the IRIs/literals are stored in a lookup table. I’m considering doing this too, but it might not be of much benefit unless I migrate to manual memory management using the unsafe library. Any thoughts on using sun.misc.Unsafe?

Håvard
_______________________________________________
rdf4j-dev mailing list
rdf4j-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/rdf4j-dev


--
-------------------------------------------------------------------
Jerven Bolleman                        Jerven.Bolleman@sib.swiss
SIB Swiss Institute of Bioinformatics  Tel: +41 (0)22 379 58 85
CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
1211 Geneve 4,
Switzerland     www.sib.swiss - www.uniprot.org
Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------

Follow-Ups:
- Re: [rdf4j-dev] New Memory Store - Planning
  - From: Håvard Ottestad

References:
- [rdf4j-dev] New Memory Store - Planning
  - From: Håvard M. Ottestad

Prev by Date: Re: [rdf4j-dev] New Memory Store - Planning
Next by Date: Re: [rdf4j-dev] New Memory Store - Planning
Previous by thread: Re: [rdf4j-dev] New Memory Store - Planning
Next by thread: Re: [rdf4j-dev] New Memory Store - Planning
Index(es):
- Date
- Thread

Breadcrumbs