[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [rdf4j-dev] New Memory Store - Planning
|
Hi Håvard,
On 07/17/2017 11:42 AM, Håvard Ottestad wrote:
Hi Jerven,
I want to make lookups for when S and P are known to be as fast as when
only S is known. And lookups when S, P and O are known should be just as
fast too. B+ trees handle this really well, since lookups are O(log n)
regardless. Which indexes (eg. SPOC, POSC, OPSC ....) the user wants
will be configurable.
Yes if you want S only queries to return as fast as SP or P only queries
then my approach does not work as is. Now a trick I was actually working
on for my read only store is to note which IRI namespaces are in the S
section. This follows from the assumption that while there are in the
order of 100 different predicates most subjects only use a few of them
and that there is a correlation between subject IRI namespace and
predicate use. Which quickly allows you to filter out P's that won't match.
B+ trees also return the data sorted, which is very useful when doing
joins. I don't think the rdf4j join algorithm can take advantage of
this, but extending them to do merge-join will make joins much faster.
Van Emde Boas search trees have this behavior as well or at least burst
trees that are derived from them can have this behavior on iteration.
I have been considering your approach, by extending it with multiple
layers of hashmaps so that you can do the following when S and P are known:
store.getSubject(S).getPredicate(P).iterator()
However hashmaps in java use arrays to store their data, and these are
usually initialised to some size which will mean that it takes up a lot
more memory than needed. Hashmaps also aren't sorted.
Yes but they can be set to 2 entries per default.
I think Jena uses this approach for their Model (which is functionally
closer to MemoryStore than to RDF4J Model). Jena has an optimisation
though which is that the nested hashmaps are arraylists as long as they
are small.
Which makes sense. However, I think you need to be clear if you want to
be a Model or a SPARQL store in main access because they do have
different behaviors in practice.
In any case I think what you are aiming towards is more a row orientated
data layout while if you want real speed you need to think columnar.
Regards,
Jerven
Håvard
_______________________________________________
rdf4j-dev mailing list
rdf4j-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/rdf4j-dev
--
-------------------------------------------------------------------
Jerven Bolleman Jerven.Bolleman@sib.swiss
SIB Swiss Institute of Bioinformatics Tel: +41 (0)22 379 58 85
CMU, rue Michel Servet 1 Fax: +41 (0)22 379 58 58
1211 Geneve 4,
Switzerland www.sib.swiss - www.uniprot.org
Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------