Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [rdf4j-dev] New Memory Store - Planning

Hi Håvard,


On 07/17/2017 11:42 AM, Håvard Ottestad wrote:
Hi Jerven,

I want to make lookups for when S and P are known to be as fast as when only S is known. And lookups when S, P and O are known should be just as fast too. B+ trees handle this really well, since lookups are O(log n) regardless. Which indexes (eg. SPOC, POSC, OPSC ....) the user wants will be configurable.

Yes if you want S only queries to return as fast as SP or P only queries
then my approach does not work as is. Now a trick I was actually working on for my read only store is to note which IRI namespaces are in the S section. This follows from the assumption that while there are in the order of 100 different predicates most subjects only use a few of them and that there is a correlation between subject IRI namespace and predicate use. Which quickly allows you to filter out P's that won't match.

B+ trees also return the data sorted, which is very useful when doing joins. I don't think the rdf4j join algorithm can take advantage of this, but extending them to do merge-join will make joins much faster.
Van Emde Boas search trees have this behavior as well or at least burst trees that are derived from them can have this behavior on iteration.

I have been considering your approach, by extending it with multiple layers of hashmaps so that you can do the following when S and P are known:

store.getSubject(S).getPredicate(P).iterator()

However hashmaps in java use arrays to store their data, and these are usually initialised to some size which will mean that it takes up a lot more memory than needed. Hashmaps also aren't sorted.
Yes but they can be set to 2 entries per default.

I think Jena uses this approach for their Model (which is functionally closer to MemoryStore than to RDF4J Model). Jena has an optimisation though which is that the nested hashmaps are arraylists as long as they are small.
Which makes sense. However, I think you need to be clear if you want to be a Model or a SPARQL store in main access because they do have different behaviors in practice.

In any case I think what you are aiming towards is more a row orientated data layout while if you want real speed you need to think columnar.

Regards,
Jerven

Håvard


_______________________________________________
rdf4j-dev mailing list
rdf4j-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/rdf4j-dev


--
-------------------------------------------------------------------
Jerven Bolleman                        Jerven.Bolleman@sib.swiss
SIB Swiss Institute of Bioinformatics  Tel: +41 (0)22 379 58 85
CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
1211 Geneve 4,
Switzerland     www.sib.swiss - www.uniprot.org
Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------


Back to the top