Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [rdf4j-dev] Question about parsing large files
  • From: Jerven Tjalling Bolleman <Jerven.Bolleman@sib.swiss>
  • Date: Wed, 17 Jan 2024 21:17:58 +0000
  • Accept-language: en-GB, en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=sib.swiss; dmarc=pass action=none header.from=sib.swiss; dkim=pass header.d=sib.swiss; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=6UNN5fLpes8Z2rIcTrPOQsXjmdsRtvAI10sW3RHZ1XU=; b=erxCv6S23HlXs3uy9J6AeXRa52CQlz0z1O5uZWmnxN7FDa1hTpPBwrKIo6HdUHDmLuzeX1FPF5fdo/WwjYwkAJVN00X5AQY7FMV85O77ubkbAcoXAJCcoIEbeau9IxsHNNtSzFZX6Fee/643ukn/iThfwUg1z9HwT2uYyYTKPLQX4wE1QU+IiF2IORteKuBohEp5OvSfu0X4AjpmYTzc6AcV27qaO7yAum4j3Lqm9AwRlJz1fkSTPRg3Fu7uTkR/LlZ64cSOUNb4vxQaWOUUQ5bXWvys7JN4fpro/43DkJqe/5cxMqvJ/qlq+o81QPWZMJBw79yGMoIlQDrLNUn0gg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=PwayOoFasieKq71WvxmPekWdgYNpNkOjhTwqXL4zcIsqYxzdsbxyxFo0aIin/v87huaLj9kWObxkQp63HWrNvYq2PYPBpTh6bOOu70osCUkp6mdQAEHD30abV64KfnkRMTdVpsVdSFF1+YTwPTTZ0DCo9merDEDAN1HulxP9macvBEwvP6ZGrTvxYnupBjDyP97xo76LY5o2NWci4WwDMrDSa5H75cC0pNEpBhaOBMxS/sJfM3t/J9z94XwbdJnMmLxacBDjhz8JD7LmmZZsyoe5kmJNtq0SOON6I/z/AkK4J8HJSGII8Ae9LxRZW2VCBuUDgSSTKiUWcRZ8L0Rxiw==
  • Delivered-to: rdf4j-dev@xxxxxxxxxxx
  • List-archive: <https://www.eclipse.org/mailman/private/rdf4j-dev/>
  • List-help: <mailto:rdf4j-dev-request@eclipse.org?subject=help>
  • List-subscribe: <https://www.eclipse.org/mailman/listinfo/rdf4j-dev>, <mailto:rdf4j-dev-request@eclipse.org?subject=subscribe>
  • List-unsubscribe: <https://www.eclipse.org/mailman/options/rdf4j-dev>, <mailto:rdf4j-dev-request@eclipse.org?subject=unsubscribe>
  • Msip_labels:
  • Thread-index: AQHaSXxZnajjuMlj3E+80Ax8sBT3jrDebIMAgAAKyICAAAp6+A==
  • Thread-topic: [rdf4j-dev] Question about parsing large files

Hi Dan,

I agree with Håvard that this would be best discussed on github. Because the current emails are rather vague and therefore kinda hard to answer.

RIO parsers are streaming, so something else is going on meaning we need a lot more details. 

Please open an discussion and give us some code snippets.

Regards,
Jerven

SIB logo
Jerven Tjalling Bolleman
Principal Software Developer
SIB | Swiss Institute of Bioinformatics
1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland
t +41 22 379 58 85
Jerven.Bolleman@sib.swiss - www.sib.swiss


From: rdf4j-dev <rdf4j-dev-bounces@xxxxxxxxxxx> on behalf of Dan S via rdf4j-dev <rdf4j-dev@xxxxxxxxxxx>
Sent: 17 January 2024 21:38
To: rdf4j developer discussions <rdf4j-dev@xxxxxxxxxxx>
Cc: Dan S <danielms853@xxxxxxxxx>
Subject: Re: [rdf4j-dev] Question about parsing large files
 
Hi Håvard,

To clarify, we think this email should be related to the internal rdf4j development. The parser takes in a filename/handle and returns an iterator. A major selling point of the iterator abstraction is that it should enable extraction of elements without necessarily loading all of them into memory at once. I guess what we're asking is whether the parser can be enhanced so that it can read larger files than can be loaded in memory (maybe this is something we may also be able to help with). Presumably if a large enough lookahead is provided it might be able to parse out the next few triples, and as the iterator is advanced through it can free up triples that have already been read. This might be especially useful as the parser is a standalone component of RDF4J which can be imported via maven/gradle into other projects. It would be super useful to find a parser for very large files, and we were wondering if the rdf4j one could eventually become the solution.

Thanks,

Dan 




On Wed, Jan 17, 2024, 20:00 Håvard Ottestad via rdf4j-dev <rdf4j-dev@xxxxxxxxxxx> wrote:
Hi Benjamin,

Could you post this on the GitHub discussion section? We prefer to keep the dev email just focused on the internal development of RDF4J. 

That being said I would recommend trying the Nquads instead. And if you are inserting the data into a database then make sure to use isolation level NONE. 

Cheers,
Håvard

On 17 Jan 2024, at 20:35, Benjamin Herber (BLOOMBERG/ 919 3RD A) via rdf4j-dev <rdf4j-dev@xxxxxxxxxxx> wrote:


Hi everyone!

I'm currently working on a project that involves bulk loading larger files (right now limited to n3, ttl and associated family). I was trying to parse about a 100 million triples (~13 GB) and it caused the parser to run out of memory with the JVM set to have 32 GB of heap.

Looked quickly into the parser implementation and it seems like the parser is not able to be set to parse per statement iteration. So what I am doing now is trying to chunk the larger triple files into a series of smaller ones and load each individually, but this is proving rather error prone and unmaintainable in the long term for different formats.

Does anyone have any insights into how to better approach this or work around a stream parsing? Also, newer to the codebase, so any pointers if I missed something would be appreciated!

Thank you!
- Benjamin Herber



_______________________________________________
rdf4j-dev mailing list
rdf4j-dev@xxxxxxxxxxx
To unsubscribe from this list, visit https://www.eclipse.org/mailman/listinfo/rdf4j-dev
_______________________________________________
rdf4j-dev mailing list
rdf4j-dev@xxxxxxxxxxx
To unsubscribe from this list, visit https://www.eclipse.org/mailman/listinfo/rdf4j-dev

Back to the top