Re: [rdf4j-dev] New Memory Store

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [rdf4j-dev] New Memory Store - Planning

From: Jerven Tjalling Bolleman <jerven.bolleman@sib.swiss>
Date: Mon, 17 Jul 2017 12:20:22 +0200
Delivered-to: rdf4j-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/rdf4j-dev>
List-help: <mailto:rdf4j-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/rdf4j-dev>, <mailto:rdf4j-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/rdf4j-dev>, <mailto:rdf4j-dev-request@eclipse.org?subject=unsubscribe>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0

Hi Håvard,


On 07/17/2017 11:42 AM, Håvard Ottestad wrote:

Hi Jerven,
I want to make lookups for when S and P are known to be as fast as whenonly S is known. And lookups when S, P and O are known should be just asfast too. B+ trees handle this really well, since lookups are O(log n)regardless. Which indexes (eg. SPOC, POSC, OPSC ....) the user wantswill be configurable.


Yes if you want S only queries to return as fast as SP or P only queries

then my approach does not work as is. Now a trick I was actually workingon for my read only store is to note which IRI namespaces are in the Ssection. This follows from the assumption that while there are in theorder of 100 different predicates most subjects only use a few of themand that there is a correlation between subject IRI namespace andpredicate use. Which quickly allows you to filter out P's that won't match.

B+ trees also return the data sorted, which is very useful when doingjoins. I don't think the rdf4j join algorithm can take advantage ofthis, but extending them to do merge-join will make joins much faster.

Van Emde Boas search trees have this behavior as well or at least bursttrees that are derived from them can have this behavior on iteration.

I have been considering your approach, by extending it with multiplelayers of hashmaps so that you can do the following when S and P are known:
store.getSubject(S).getPredicate(P).iterator()
However hashmaps in java use arrays to store their data, and these areusually initialised to some size which will mean that it takes up a lotmore memory than needed. Hashmaps also aren't sorted.

Yes but they can be set to 2 entries per default.

I think Jena uses this approach for their Model (which is functionallycloser to MemoryStore than to RDF4J Model). Jena has an optimisationthough which is that the nested hashmaps are arraylists as long as theyare small.

Which makes sense. However, I think you need to be clear if you want tobe a Model or a SPARQL store in main access because they do havedifferent behaviors in practice.

In any case I think what you are aiming towards is more a row orientateddata layout while if you want real speed you need to think columnar.


Regards,
Jerven


Håvard


_______________________________________________
rdf4j-dev mailing list
rdf4j-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/rdf4j-dev


--
-------------------------------------------------------------------
Jerven Bolleman                        Jerven.Bolleman@sib.swiss
SIB Swiss Institute of Bioinformatics  Tel: +41 (0)22 379 58 85
CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
1211 Geneve 4,
Switzerland     www.sib.swiss - www.uniprot.org
Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------

References:
- [rdf4j-dev] New Memory Store - Planning
  - From: Håvard M. Ottestad
- Re: [rdf4j-dev] New Memory Store - Planning
  - From: Jerven Tjalling Bolleman
- Re: [rdf4j-dev] New Memory Store - Planning
  - From: Håvard Ottestad

Prev by Date: Re: [rdf4j-dev] New Memory Store - Planning
Next by Date: Re: [rdf4j-dev] New Memory Store - Planning
Previous by thread: Re: [rdf4j-dev] New Memory Store - Planning
Next by thread: [rdf4j-dev] Question about READ_UNCOMMITTED
Index(es):
- Date
- Thread

Breadcrumbs